A Job Twin is a time-boxed simulation of a real production AI engineering task. You receive the same brief, codebase scaffold, and eval harness a new hire would receive on their first production sprint. A mentor grades your submission against a rubric that has been calibrated against a gold set. Your score binds to a cryptographic credential if you clear the bar.

How long does a Job Twin take to complete?

Each Job Twin is designed for 8 hours of focused work. You may spread that over multiple days. The clock is self-reported; the rubric grades the artifact, not the hours.

Can I retake a Job Twin?

Yes. Retakes use a different prompt seed so you cannot memorise the answer. Each attempt is timestamped on your learning record; only the highest calibrated score binds to the credential.

Are Job Twin briefs public before I attempt them?

The rubric and the five dimensions you are graded against are shared upfront. The specific prompt seed (the exact data, the exact adversarial edges) is private until you start the attempt, to preserve the signal.

What happens if I do not clear the bar on a Job Twin?

You receive full mentor feedback — the graded rubric, inline comments on your code, and a score breakdown. You may retake immediately or after addressing the feedback. There is no penalty for retakes; only the highest calibrated score counts.

Program · 2026-05-06

What Is a Job Twin?

A graded simulation of real production AI work — same stack, same constraints, same adversarial eval a hiring manager would run. Not a quiz. Not a toy problem. Here is the full picture.

§1

The problem with toy projects

Every AI bootcamp tutorial produces the same RAG demo: 80 lines of LangChain against attention_is_all_you_need.pdf. The demo works. The production system built six months later does not — because production RAG fails on scanned PDFs, on multilingual corpora, on near-duplicate chunks, on latency budgets and citation contracts that survive a legal review.

Hiring managers have learned to discount tutorial projects accordingly. A GitHub repo called rag-demo and a Coursera certificate tell the interviewer the same thing: you have followed instructions. Neither tells them whether you can ship the thing under real pressure, with a real eval harness, against real data.

Job Twins are our answer. Each one is a compressed version of a real production sprint — the scenario, the constraints, the data, and the eval harness are drawn from actual engineering work. You ship the artifact. A calibrated mentor grades it against a public rubric. The score becomes a signed claim in a cryptographic credential that any employer can verify without going through us.

§2

What a Job Twin is, precisely

A Job Twin has four fixed components:

A scenario brief. A realistic engineering task in a fictional but plausible company. The brief includes the team context, the deadline, the acceptance criteria, and any constraints (latency budget, cost ceiling, data format). You receive it at the start of the attempt.
A dataset and scaffold. A corpus of real-world-shaped data (not curated-for-tutorials data) plus a repository scaffold with a Makefile, pyproject.toml, and a Docker stub. You are not expected to build infrastructure from scratch; you are expected to ship working software.
An eval harness. The exact scorer we run against your submission — open source, runnable locally. You pass or fail the quantitative bar before you submit, not after.
A public rubric. Five to seven grading dimensions, each with a 1–5 scale and anchored descriptors. You see the rubric before you start. We do not believe in surprise rubrics.

The time box is 8 hours of focused work, self-reported. The rubric grades the artifact — not the hours, not the commit history, not whether you used an AI assistant. Modern AI engineering is AI-assisted; what the rubric measures is judgment: did you make defensible choices, document your tradeoffs, and hit the quantitative bar?

§3

The five Job Twins

The Production AI Engineer Track has five Job Twins, spaced across the 14-week cohort. Together they cover the five skills hiring managers most commonly test in AI engineering final rounds.

Week 6·JT1 — RAG

RAG against messy documents

Build a production RAG service against a 200-PDF corpus of scanned, multilingual, and near-duplicate documents. Ship a deployed URL that answers queries with citations, passing recall@5 ≥ 75% on a held-out eval set.

llm.rag.retrieval-pipeline llm.rag.context-window-management

Week 8·JT2 — Evals

LLM evaluation harness

Design and ship an offline and online eval harness for an existing LLM-powered feature. Demonstrate measurement of regression, latency budget, and output quality across a controlled set of adversarial prompts.

llm.evals.offline llm.evals.online

Week 10·JT3 — Latency & Cost

Latency and cost engineering

Optimize an existing LLM service to hit a strict p99 latency target and a per-query cost budget without degrading output quality. Deliver a before/after instrumentation report with the production trace.

llm.latency.p99-optimization llm.cost.token-budget

Week 12·JT4 — Incident

AI production incident response

Respond to a staged production incident in a simulated on-call environment. Diagnose root cause, implement a hotfix, write a postmortem, and propose a monitoring change to prevent recurrence.

llm.production.incident-response llm.production.monitoring

Week 14·JT5 — Safety

Prompt injection and output safety

Harden an existing LLM API endpoint against a provided adversarial prompt battery. Implement defense layers, pass the attack suite, and document the tradeoffs between security and latency.

llm.safety.prompt-injection-defense llm.safety.output-filtering

§4

How a Job Twin is graded

Every submission is graded by a mentor — a practicing AI engineer, not a teaching assistant or an automated system. Mentors join UpSkillZone by passing a calibration check: they grade a hidden gold set of pre-scored submissions, and their Cohen's kappa agreement against the reference scores must clear 0.7. If it drops below that floor, they recalibrate before further grading counts toward a credential.

The mentor grades against the public rubric, attaches inline comments to your code and writeup, and submits a dimension-by- dimension score. The platform aggregates the scores, checks whether you cleared the bar, and either mints the credential or queues you for a retake with full feedback.

The grading queue is blinded: a mentor cannot grade a learner they have a prior relationship with. Self-review attempts are blocked at the queue layer. The per-mentor kappa history is visible on every credential page, so employers can see not just the score but the calibration of the person who assigned it.

§5 · Sample

A look inside Job Twin 1

The following is an excerpt from the Job Twin 1 brief — the exact scenario text a learner receives at the start of the attempt.

You are an AI engineer hired into Acme Corp, a 600-person business-software company with a customer-support knowledge base of 12,000 PDFs accumulated over 11 years. The corpus is messy in specific, real-world ways: ~4,200 are clean digital PDFs; ~3,800 are scanned-then-OCR'd with variable quality; ~1,900 are mostly-image PDFs where text extraction returns a few stray words or nothing; ~1,400 are multilingual; ~700 are near-duplicates.
Your manager pulls you in on Day 4: “Build us a RAG service. Agents type a customer question, your service returns an answer with citations into our PDFs so the agent can verify. You need to pass our retrieval-quality bar (≥75% recall@5 on the held-out eval set) and ship a working URL we can hit. You have 8 hours of focused work. Show me Friday.”
You hand back: a deployed service at a public URL; source code in a GitHub repo with a make eval target to reproduce your retrieval numbers; a retrieval-quality report showing recall@5, MRR, and citation accuracy; and a 1-page tradeoff write-up explaining your chunking, embedding, and re-ranking choices.

Excerpt from Job Twin 1 brief, v0.1. Prompt seed varies per attempt.

§6

What a Job Twin is not

Not a quiz. There are no multiple-choice questions. The artifact is runnable software that passes or fails a mechanical eval harness, plus a written tradeoff document. You cannot guess your way through.
Not AI-graded. The quantitative bar (recall@5, latency, cost) is measured mechanically, but the rubric scoring is human. A calibrated mentor reads your code, your write-up, and your tradeoff choices. Auto-grading-only credentials tell you whether the regex matched, not whether you can ship.
Not a live interview. You work asynchronously, on your own machine, with full access to documentation, libraries, and AI assistants. The scenario is how you actually work in production — not how you perform under a timer with a stranger watching.
Not peer-reviewed. A calibrated mentor with a kappa on file grades your submission. Peer review is useful as a second pass; it is not a substitute for a grader who has been calibrated against a gold set.

§7

How your score becomes a credential

When you clear the bar on a Job Twin, the platform mints a W3C Verifiable Credential in Open Badges 3.0 form. The credential is a JSON-LD document at a stable URL. It contains the skill assertions you earned, the graded artifacts behind each claim, the mentor key ID and their kappa at grading time, and an Ed25519 signature from the UpSkillZone issuer key.

To verify it, an employer pastes the credential URL into any W3C VC verifier — including the one at /verify. The verifier fetches the credential, checks the Ed25519 signature against the JWKS endpoint, and checks the revocation status. No login, no API key, no platform mediation. The credential verifies against a public key that will outlive any platform.

After all five Job Twins are cleared, the full track credential mints — a composite claim that includes every skill assertion from all five twins, each with its graded artifact, mentor kappa, and 90-day outcome attestation (once a hire is made).

FAQ

Frequently asked questions

What is a Job Twin?: A Job Twin is a time-boxed simulation of a real production AI engineering task. You receive the same brief, codebase scaffold, and eval harness a new hire would receive on their first production sprint. A mentor grades your submission against a rubric that has been calibrated against a gold set. Your score binds to a cryptographic credential if you clear the bar.
How long does a Job Twin take to complete?: Each Job Twin is designed for 8 hours of focused work. You may spread that over multiple days. The clock is self-reported; the rubric grades the artifact, not the hours.
Can I retake a Job Twin?: Yes. Retakes use a different prompt seed so you cannot memorise the answer. Each attempt is timestamped on your learning record; only the highest calibrated score binds to the credential.
Are Job Twin briefs public before I attempt them?: The rubric and the five dimensions you are graded against are shared upfront. The specific prompt seed (the exact data, the exact adversarial edges) is private until you start the attempt, to preserve the signal.
What happens if I do not clear the bar on a Job Twin?: You receive full mentor feedback — the graded rubric, inline comments on your code, and a score breakdown. You may retake immediately or after addressing the feedback. There is no penalty for retakes; only the highest calibrated score counts.