UpSkillZone

Fundamentals · 2026-05-06

RAG System Architecture: A Production Engineer's Guide

Retrieval-augmented generation has a simple concept and a complex production reality. This guide covers the full architecture: ingestion, retrieval, augmentation, and generation — with the tradeoffs that matter.


§1

The RAG Concept in One Paragraph

RAG solves a specific problem: language models have a training cutoff and a fixed parameter set, so they cannot know about your proprietary documents, your product's current state, or events after their training data ends. RAG fixes this by retrieving relevant documents at query time and injecting them into the prompt before generation. The model then uses those documents as grounding context, which reduces hallucination and enables knowledge-base-driven applications without fine-tuning.

The "simple" version of RAG is: embed the query, find the nearest vectors in your store, stuff the documents into the prompt, generate. This works in demos. The production version adds chunk quality management, hybrid retrieval, re-ranking, metadata filtering, context window budgeting, faithfulness evaluation, and a dozen other components that each have their own failure modes.

Understanding the concept is easy. Understanding why a specific RAG deployment is returning irrelevant documents, generating unfaithful answers, or costing three times what it should — that is the engineering work.


§2

Ingestion Pipeline Design

The ingestion pipeline converts raw documents into retrievable chunks. It consists of: document loading (PDF, DOCX, HTML, Markdown parsers), preprocessing (cleaning, deduplication, normalization), chunking (splitting into retrieval units), embedding (converting chunks to vectors), and indexing (storing in a vector database with metadata).

Document loading is more complex than it appears. PDFs have layout information, tables, and images that naive parsers lose. HTML documents have navigation elements and ads that pollute chunks. DOCX files have tracked changes and hidden text. Production ingestion pipelines need document-type-specific parsing logic and validation that checks parse quality before embedding.

The ingestion pipeline runs in two modes: batch (initial load of a large corpus) and incremental (ongoing updates as documents change). Incremental ingestion requires document fingerprinting to detect changes, tombstoning to remove deleted documents from the index, and idempotency so re-runs do not create duplicate chunks. Most production teams underinvest in incremental ingestion and end up with stale indexes that users distrust.


§3

Chunking and Embedding

Chunking is the highest-leverage decision in a RAG pipeline because it determines the granularity of retrieval. Chunks that are too large return imprecise matches with lots of irrelevant content in each chunk. Chunks that are too small return precise matches but miss context that spans multiple sentences. The right chunk size depends on your document type, your query distribution, and your model's context window.

Embedding converts each chunk into a dense vector that captures semantic meaning. The embedding model choice determines your retrieval ceiling — a weak embedding model means even perfect retrieval logic cannot surface the right chunks. OpenAI's text-embedding-3-small is the safe default. BGE-M3 and E5-mistral are competitive open-source alternatives with strong multilingual support.

A frequently missed production concern is embedding normalization. Most vector similarity searches assume unit-normalized vectors for cosine similarity to work correctly. If your embedding model does not normalize by default (many do not), you must normalize before indexing. Unnormalized vectors produce similarity scores that are meaningless, and the failure is silent — the system returns results, just bad ones.


§4

Vector Store Selection

The vector store is responsible for indexing vectors and serving approximate nearest neighbor (ANN) queries at low latency. The major options are: Pinecone (managed, expensive at scale), Weaviate (open-source, module ecosystem), Qdrant (Rust-based, fast self-hosting), pgvector (Postgres extension), and Chroma (local development). Each has different operational tradeoffs.

For most production applications with fewer than 10M vectors, pgvector running on a Postgres instance you already operate is the right answer. It eliminates a separate service, leverages existing operational knowledge, and supports exact nearest neighbor search for small datasets. The scaling ceiling is real — pgvector is not designed for billion-vector indexes — but most applications never approach it.

For applications that need filtered vector search at scale, Qdrant is consistently the best performer. Its payload filtering happens before the ANN search rather than after, which means filtered queries do not degrade in quality as the dataset grows. This is the correct architecture — post-filtering ANN search produces poor results when filters are selective.


§5

Retrieval: Dense, Sparse, Hybrid

Dense retrieval uses embedding similarity — the query and documents are embedded into the same vector space, and the nearest neighbors by cosine similarity are returned. Dense retrieval excels at semantic matching: "what medications interact with warfarin" matches "drugs contraindicated with anticoagulants" even without keyword overlap. But it fails on exact matching: "product model XR-2040" may not retrieve the spec sheet for the XR-2040 if it was trained on few examples of alphanumeric product codes.

Sparse retrieval (BM25) uses keyword overlap. It is fast, interpretable, and handles exact terms well. It fails when the query and document use different vocabulary for the same concept. BM25 is the retrieval algorithm that every production system should have available, even if dense retrieval is primary.

Hybrid retrieval combines both signals using Reciprocal Rank Fusion (RRF) or a learned combiner. RRF is the practical default: rank each result by its dense rank and its sparse rank, sum the reciprocals, re-rank by the combined score. In practice, hybrid retrieval consistently outperforms either method alone across diverse query types. The overhead is running two retrieval passes, which adds ~5–15ms depending on corpus size.


§6

Augmentation and Prompt Construction

Augmentation is the step where retrieved chunks are assembled into the final prompt. This is more than concatenation. You need to: decide how many chunks to include (k), order them (most relevant first? chronological?), format them (with source citations? with headers?), handle truncation when the total token count exceeds the context budget, and inject metadata (document title, date, section) that helps the model weight sources correctly.

Context window budgeting is a critical production concern. If you retrieve k=10 chunks and each is 500 tokens, you have consumed 5,000 tokens of context before the system prompt or the conversation history. Models with 128k context windows make this feel like a non-issue — until you get the billing statement. Budget your context explicitly: system prompt N tokens, conversation history M tokens, retrieved context C tokens, generation buffer G tokens. Sum must be below the model's limit.

Source citation formatting matters for faithfulness. If you want the model to cite specific documents, you must include document identifiers in the context and instruct the model explicitly on citation format. Models will hallucinate citations if the instruction is ambiguous. A production prompt template for RAG should be tested against cases where the correct answer is "I don't know" — this is where most RAG systems fail.


§7

Generation and Output Handling

Generation is the step most engineers focus on and the step that matters least if retrieval is broken. A good generator cannot compensate for retrieving the wrong documents. Debug retrieval first — measure recall@k before measuring answer quality.

Output handling for RAG includes: faithfulness checking (did the model use the retrieved context or did it hallucinate?), citation validation (are the cited sources real and were they actually retrieved?), answer completeness checking (did the model address the full question?), and refusal detection (did the model correctly say "I don't know" when the answer is not in the context?).

A production RAG system should log every query with the retrieved chunks, the final prompt, and the generated output. This logging is essential for debugging failures and building new eval cases. Without it, you are debugging blind. Storage cost for this logging is real but small compared to the cost of shipping a system you cannot diagnose.


FAQ

Frequently asked questions

What is RAG in simple terms?
RAG (Retrieval-Augmented Generation) is a pattern where, instead of relying solely on what an LLM learned during training, you retrieve relevant documents from a database at query time and include them in the prompt. The model then generates an answer using those documents as context. This lets you build systems that answer questions about your specific data — product documentation, internal knowledge bases, recent news — without retraining the model. The tradeoff is that the quality of answers depends heavily on the quality of retrieval. If you retrieve the wrong documents, the model will answer based on wrong information or refuse to answer.
When should you use RAG vs. fine-tuning?
Use RAG when your primary need is access to external or updated knowledge — documents the model was not trained on, or information that changes frequently. Use fine-tuning when your primary need is a change in behavior — a different response style, a specialized output format, or improved performance on a specific task type where you have thousands of labeled examples. The common mistake is using fine-tuning to teach the model facts (it memorizes them poorly and they go stale) or using RAG to change response style (it requires constant prompt engineering and is brittle). When you need both knowledge grounding and behavior change, combine both: fine-tune for behavior, use RAG for knowledge.
What vector database should I use for RAG?
Start with pgvector if you already run Postgres. It handles up to a few million vectors with acceptable latency, requires no new operational expertise, and supports hybrid search via pg_search or ParadeDB. Move to Qdrant if you need high-performance filtered search at scale, or if you need self-hosted infrastructure with a strong Rust-based performance profile. Use Pinecone if you need a fully managed service and cost is not the primary constraint. Avoid Chroma in production — it lacks the operational maturity for production deployments. Weaviate is a reasonable choice if you need its module ecosystem (multimodal search, integrated re-ranking), but the operational overhead is higher than Qdrant.
How do you measure RAG quality?
Measure at two levels: retrieval quality and generation quality. Retrieval quality: Recall@k (what fraction of relevant documents are in the top k results?) and MRR (Mean Reciprocal Rank, where in the result list does the first relevant document appear?). Generation quality: faithfulness (does the answer use only information from the retrieved context?), answer relevance (does the answer address the question?), and answer completeness (does it address all parts of a multi-part question?). For automated measurement, RAGAS is the standard framework. It uses an LLM-as-judge approach to score faithfulness and relevance, which is imperfect but practical. Ground truth evaluation with human-labeled answer-context pairs is the gold standard but expensive to build.
What are the most common RAG failures in production?
The five most common failures: (1) Retrieval miss — the relevant document is in the corpus but is not retrieved because the embedding does not capture the query-document relationship. Fix: hybrid retrieval, better chunking, or a domain-adapted embedding model. (2) Context window overflow — too many chunks are retrieved and the most relevant ones are truncated. Fix: reduce k or implement context compression. (3) Faithfulness failure — the model ignores retrieved context and generates from training knowledge. Fix: stronger system prompt instructions and faithfulness evals. (4) Stale index — documents were updated but the index was not re-ingested. Fix: incremental ingestion with change detection. (5) Refusal hallucination — the model says 'I don't know' when the answer is in the retrieved context. Fix: improve context formatting and system prompt instructions.

Next

ragarchitecturefundamentals