UpSkillZone AI

Job Twin brief — UpSkillZone AI

Job Twin 2 — RAG Evaluation Harness

day twin·6 hours·pass 70%

Scenario

Scenario

Your team ships a customer-support RAG. Last week a hallucination cost a sales call. Build the evaluation harness that would have caught it.

Time-box: 6 hours. Submit a runnable repository plus a tradeoff writeup.

Deliverables

Deliverables

  1. Eval suite — at least 12 graded test cases, half adversarial.
  2. Judge calibration — kappa report against the seed rubric.
  3. Tradeoff writeup — 300–600 words on what you chose NOT to test and why.

Materials

Time-box

6 hours

Server-authoritative clock. The deadline is hard; auto-save does not extend it.

Submission modes

  • repo_url
  • code_editor

The first mode listed is the default on the submit screen.

Rubric

Each dimension scored on [0.0, 1.0] in 0.05 increments. The overall score is the weighted average; pass at 70%.

DimensionWeightWhat it tests

Problem decomposition

problem_decomposition

15%Breaks the failure surface into testable pieces.

Programmatic safety checks

programmatic_safety_checks

15%Hard checks before judge models run.

Regression coverage

regression_testing

15%Catches the original incident class.

Retrieval grounding

retrieval_grounding

15%Citations check; off-corpus claims fail.

Reasoned tradeoffs

reasoned_tradeoffs

15%Writeup defends the cuts you made.

Code quality

code_quality

15%Readable, deterministic, runs.

Docs quality

docs_quality

10%README explains how to run + extend.

Failure modes

Self-checks the learner answers before submit. Critical checks block submission unless explicitly forced; the force flag is then surfaced to the mentor.

  • F1

    Did you include at least 4 adversarial cases?

    critical

  • F2

    Does `pytest` pass on a clean clone?

    critical

  • F3

    Is the judge prompt versioned in the repo?

    reflective

Skill assertions on offer

On a passing review the mentor selects a subset of these to assert, with an asserted weight bounded by the per-skill ceiling shown below.

  • llm.evals.dataset-design

    LLM evals — dataset design

    max weight 1.00
  • llm.rag.retrieval-pipeline

    RAG retrieval pipeline

    max weight 0.80

Mentor SLA

72h

From mentor claim to signoff.

Pass threshold

70%

Weighted-average overall score.

Re-attempts

1

Higher of the two scores flows to the credential.

Start this twin

The clock starts when you press start. Read the brief above first. You will be asked to sign in if you have not already.

Open jt-rag-eval-1 in dashboard →