Job Twin brief — UpSkillZone AI

Job Twin 2 — RAG Evaluation Harness

day twin·6 hours·pass 70%

Scenario

Your team ships a customer-support RAG. Last week a hallucination cost a sales call. Build the evaluation harness that would have caught it.

Time-box: 6 hours. Submit a runnable repository plus a tradeoff writeup.

Deliverables

Eval suite — at least 12 graded test cases, half adversarial.
Judge calibration — kappa report against the seed rubric.
Tradeoff writeup — 300–600 words on what you chose NOT to test and why.

Materials

Starter reporepo
Seed rubric (CSV)dataset

Time-box

6 hours

Server-authoritative clock. The deadline is hard; auto-save does not extend it.

Submission modes

repo_url
code_editor

The first mode listed is the default on the submit screen.

Rubric

Each dimension scored on [0.0, 1.0] in 0.05 increments. The overall score is the weighted average; pass at 70%.

Dimension	Weight	What it tests
Problem decomposition problem_decomposition	15%	Breaks the failure surface into testable pieces.
Programmatic safety checks programmatic_safety_checks	15%	Hard checks before judge models run.
Regression coverage regression_testing	15%	Catches the original incident class.
Retrieval grounding retrieval_grounding	15%	Citations check; off-corpus claims fail.
Reasoned tradeoffs reasoned_tradeoffs	15%	Writeup defends the cuts you made.
Code quality code_quality	15%	Readable, deterministic, runs.
Docs quality docs_quality	10%	README explains how to run + extend.

Failure modes

Self-checks the learner answers before submit. Critical checks block submission unless explicitly forced; the force flag is then surfaced to the mentor.

F1
Did you include at least 4 adversarial cases?
critical
F2
Does `pytest` pass on a clean clone?
critical
F3
Is the judge prompt versioned in the repo?
reflective

Skill assertions on offer

On a passing review the mentor selects a subset of these to assert, with an asserted weight bounded by the per-skill ceiling shown below.

llm.evals.dataset-design
LLM evals — dataset design
max weight 1.00
llm.rag.retrieval-pipeline
RAG retrieval pipeline
max weight 0.80

Mentor SLA

72h

From mentor claim to signoff.

Pass threshold

70%

Weighted-average overall score.

Re-attempts

Higher of the two scores flows to the credential.

Start this twin

The clock starts when you press start. Read the brief above first. You will be asked to sign in if you have not already.

Open jt-rag-eval-1 in dashboard →