Job Twin brief — UpSkillZone AI
Capstone — Production AI service
Scenario
Scenario
Ship an end-to-end production AI service of your choosing. Two-week build window. Two mentor reviewers (separate ledger). Public artifact at the end.
Deliverables
Deliverables
- Repo — production-grade, with CI, tests, and a deployable container.
- Evals — domain-specific suite with adversarial coverage.
- Security review — threat model and at least one mitigated risk.
- Reflection — what you'd do differently with another two weeks.
- Public artifact — demo, write-up, or talk.
Materials
Time-box
14 days
Server-authoritative clock. The deadline is hard; auto-save does not extend it.
Submission modes
- repo_url
The first mode listed is the default on the submit screen.
Rubric
Each dimension scored on [0.0, 1.0] in 0.05 increments. The overall score is the weighted average; pass at 75%.
| Dimension | Weight | What it tests |
|---|---|---|
Problem framing problem_framing | 15% | Clear user, clear value, clear scope. |
System design system_design | 20% | Architecture matches the constraints; tradeoffs named. |
Production quality production_quality | 20% | CI, container, observability, runbook. |
Evals coverage evals_coverage | 15% | Domain-specific suite with adversarial cases. |
Security posture security_posture | 10% | Threat model plus at least one mitigated risk. |
Reflection reflection | 10% | Honest account of what you'd do differently. |
Polish polish | 10% | Public artifact is something you'd link from a resume. |
Failure modes
Self-checks the learner answers before submit. Critical checks block submission unless explicitly forced; the force flag is then surfaced to the mentor.
- F1
Does `pytest` pass on a clean clone?
critical
- F2
Does a deployable container exist and start?
critical
- F3
Is the evals suite runnable end-to-end?
reflective
- F4
Is the public artifact actually public?
reflective
Skill assertions on offer
On a passing review the mentor selects a subset of these to assert, with an asserted weight bounded by the per-skill ceiling shown below.
- max weight 1.00
llm.ops.system-design
LLM ops — system design
- max weight 1.00
llm.evals.dataset-design
LLM evals — dataset design
- max weight 0.90
llm.safety.security-review
LLM safety — security review
- max weight 1.00
llm.api.production-readiness
LLM API — production readiness
Mentor SLA
168h
From mentor claim to signoff.
Pass threshold
75%
Weighted-average overall score.
Re-attempts
none
The capstone is the final exam.
Start this twin
The clock starts when you press start. Read the brief above first. You will be asked to sign in if you have not already.