Job Twin brief — UpSkillZone AI
Job Twin 2 — RAG Evaluation Harness
Scenario
Scenario
Your team ships a customer-support RAG. Last week a hallucination cost a sales call. Build the evaluation harness that would have caught it.
Time-box: 6 hours. Submit a runnable repository plus a tradeoff writeup.
Deliverables
Deliverables
- Eval suite — at least 12 graded test cases, half adversarial.
- Judge calibration — kappa report against the seed rubric.
- Tradeoff writeup — 300–600 words on what you chose NOT to test and why.
Materials
- Starter reporepo
- Seed rubric (CSV)dataset
Time-box
6 hours
Server-authoritative clock. The deadline is hard; auto-save does not extend it.
Submission modes
- repo_url
- code_editor
The first mode listed is the default on the submit screen.
Rubric
Each dimension scored on [0.0, 1.0] in 0.05 increments. The overall score is the weighted average; pass at 70%.
| Dimension | Weight | What it tests |
|---|---|---|
Problem decomposition problem_decomposition | 15% | Breaks the failure surface into testable pieces. |
Programmatic safety checks programmatic_safety_checks | 15% | Hard checks before judge models run. |
Regression coverage regression_testing | 15% | Catches the original incident class. |
Retrieval grounding retrieval_grounding | 15% | Citations check; off-corpus claims fail. |
Reasoned tradeoffs reasoned_tradeoffs | 15% | Writeup defends the cuts you made. |
Code quality code_quality | 15% | Readable, deterministic, runs. |
Docs quality docs_quality | 10% | README explains how to run + extend. |
Failure modes
Self-checks the learner answers before submit. Critical checks block submission unless explicitly forced; the force flag is then surfaced to the mentor.
- F1
Did you include at least 4 adversarial cases?
critical
- F2
Does `pytest` pass on a clean clone?
critical
- F3
Is the judge prompt versioned in the repo?
reflective
Skill assertions on offer
On a passing review the mentor selects a subset of these to assert, with an asserted weight bounded by the per-skill ceiling shown below.
- max weight 1.00
llm.evals.dataset-design
LLM evals — dataset design
- max weight 0.80
llm.rag.retrieval-pipeline
RAG retrieval pipeline
Mentor SLA
72h
From mentor claim to signoff.
Pass threshold
70%
Weighted-average overall score.
Re-attempts
1
Higher of the two scores flows to the credential.
Start this twin
The clock starts when you press start. Read the brief above first. You will be asked to sign in if you have not already.