Users complain a RAG-powered product gives stale answers despite documents being updated weekly. The retrieval layer uses dense embeddings refreshed hourly. The LLM is Claude Sonnet 4.6. Which of the following is the MOST LIKELY cause? (a) Embedding model is too small (b) Embedding cache is keyed by document_id and is not invalidated when document content changes (c) Vector index dimensionality is too low (d) The LLM's knowledge cutoff predates the document update (e) Need to add a re-ranking layer
Entrance assessment
The gate is honest. So is the feedback.
A 25-minute, 20-item assessment. Free. The result is yours whether you enrol or not. We target a 40–55% pass rate on candidates who attempt — anything higher and we're not filtering, anything lower and we're rejecting the engineers we built the track for.
The assessment is hard so the cohort is tight. The bar filters you in or out by the same rule we apply to ourselves: have you shipped LLM-shaped product code under real constraints?
A working software engineer who is genuinely ready scores ≥ 70% (PASS). Someone who has watched 80 hours of LLM tutorials but hasn't shipped scores 40–60% (BORDERLINE) and receives a preparation path. Someone with no production engineering experience scores < 40% (BLOCKED) and is told honestly.
Static form (v0). Adaptive in Phase 2.
Hard cap. Autosaves every keystroke.
Result is yours. Prep links included.
Parallel form. One identity per attempt.
What we test, what we don't
- We test: production-AI judgment, RAG and evals patterns, agent failure modes, cost / latency tradeoffs, safety thinking.
- We don't test: Python syntax, transformer math, tool memorization, or anti-cheat theatrics. ChatGPT-pasting is allowed; items are calibrated so judgment, not recall, is what scores.
What success looks like
- You finish all 20 items inside 25 minutes. Most candidates who pass leave 3–6 minutes on the clock.
- You name a concrete first action on each tradeoff item, not a list of considerations.
- You spot the production failure modes — cache invalidation, tool-output grounding, prompt injection — without prompting from the question stem.
Five public sample items
Representative of the bar. Not in the locked bank used for scoring. Click Reveal answer on each to see the answer and reasoning. Treat these as a calibration of whether the actual assessment is for you.
Sample 1Multiple choiceRAG diagnostics Sample 2Code review (multi-select)Production hygiene A junior engineer ships this code to production. Which of the following are problems? (Select all that apply.) ```python def chat(user_id, message): response = anthropic.messages.create( model="claude-sonnet-4-6", messages=[{"role": "user", "content": f"User {user_id} asks: {message}"}], max_tokens=4000, ) return response.content[0].text ``` (a) No system prompt — no role, boundaries, safety framing (b) Prompt-injection vulnerability via message interpolation (c) user_id is leaked into the prompt unnecessarily, with PII risk (d) No error handling for API failures, timeouts, or rate limits (e) response.content[0].text is fragile and assumes structure (f) max_tokens=4000 is excessive for typical chat and inflates cost (g) No streaming — poor UX (h) No observability or cost tracking per callSample 3Tradeoff scenario (short answer)Cost discipline Your LLM-based feature costs $1,200/day at 50k DAU. Roadmap requires it to scale to 500k DAU within 90 days at the same per-DAU cost (i.e., roughly $12,000/day). Without significantly degrading quality: Name your first three interventions, in priority order, with one-sentence justification each.
Sample 4Conceptual (short answer)Evals Your LLM-as-judge agrees with human labels at Cohen's kappa = 0.42. What does this mean, and what should you do? (≤100 words.)
Sample 5Failure mode (short answer)Agents A multi-agent product is occasionally fabricating tool outputs that don't match the actual tool's response. The user-facing answer is wrong as a result. What's most likely happening, and how would you debug? (≤150 words.)
Start the attempt
The timer starts the instant you click. Anonymous attempts are allowed — no account required. An account lets you save the result, re-attempt on a parallel form after 14 days, and apply to a track if you pass.
Anti-cheat policy: results from sessions with strong evidence of identity fraud or shared-account use are voided. We do not run face-tracking proctoring; the Job Twins catch cheaters expensively enough that the entrance assessment doesn't need to.