UpSkillZone

Fundamentals · 2026-05-06

Chunking Strategies for RAG: Fixed, Semantic, Hierarchical, and When to Use Each

How you chunk documents is the single highest-leverage decision in a RAG system. This guide covers every major strategy, their tradeoffs, and how to decide without running 40 experiments.


§1

Why Chunking Matters More Than People Think

Chunking determines the fundamental unit of retrieval. If a relevant answer spans a 2,000-word section but your chunks are 200 tokens, you may retrieve 2–3 pieces of the answer but miss the connective logic that ties them together. If your chunks are 2,000 tokens and the relevant sentence is at position 1,800, the chunk is retrieved but the LLM must attend to a lot of noise to find it — and with the lost-in-the-middle effect, it may not.

The embedding model that encodes each chunk is trained to capture the semantic meaning of a text passage. That capture degrades as passage length increases — very long chunks have averaged-out representations that match queries at a topical level but not at a specific-answer level. Very short chunks have high specificity but lose surrounding context that disambiguates meaning. The optimal chunk length for your use case depends on how specific your queries are and how localized the answers are.

Chunking is also the decision you can change most easily after deployment. Unlike the embedding model or the vector store, chunking is a data transformation you can re-run. This makes it the first lever to tune when retrieval quality is poor — before changing the embedding model or the retrieval algorithm.


§2

Fixed-Size Chunking: Simple, Predictable, Often Wrong

Fixed-size chunking splits documents every N tokens (or characters) with an optional overlap. LangChain's RecursiveCharacterTextSplitter with chunk_size=1000, chunk_overlap=200 is the de facto default for tutorials, and it is the worst strategy for most production use cases. It splits mid-sentence, mid-table, and mid-code-block with no awareness of document structure.

Fixed chunking is appropriate when documents have no meaningful structure (raw transcripts, unstructured notes) or when you need absolute predictability for index management (each chunk is exactly N tokens, which simplifies cost forecasting and index size estimation). For structured documents — PDFs with sections, Markdown with headers, HTML with semantic elements — fixed chunking discards the structure that makes retrieval precise.

The overlap parameter is a rough band-aid for the boundary problem. A 200-token overlap means that a sentence that falls on a chunk boundary appears in two chunks. This increases recall at the cost of index size and potential duplicate context in retrieved results. If you must use fixed chunking, tune overlap based on your average sentence length — the overlap should be at least as long as your longest relevant sentence.


§3

Sentence and Paragraph Chunking

Sentence chunking uses sentence boundary detection (spaCy, NLTK, or a rule-based splitter) to create chunks that align with natural language units. Each chunk is one or more complete sentences. This eliminates the mid-sentence split problem and produces chunks that are more coherent than fixed-size alternatives. Paragraph chunking extends this to paragraph boundaries.

The challenge with sentence chunking is high variability in chunk size. A document with many short sentences produces chunks of 20–30 tokens. A document with complex multi-clause sentences produces chunks of 100+ tokens. This variability means your embedding representations are computed on texts of very different lengths, which can produce inconsistent similarity scores.

Paragraph chunking is a reasonable default for documents with clear paragraph structure (news articles, academic papers, documentation). It preserves the author's intentional grouping of related sentences. The downside is that paragraphs vary enormously in length — a one-sentence paragraph and a ten-sentence paragraph are both "chunks" and have very different information density. Combine paragraph chunking with a max-token guardrail: if a paragraph exceeds N tokens, split it at sentence boundaries.


§4

Semantic Chunking With Embedding Similarity

Semantic chunking uses embedding similarity between adjacent sentences to detect topic boundaries. The algorithm: embed each sentence, compute cosine similarity between adjacent pairs, find breakpoints where similarity drops below a threshold (or where the drop is a local minimum), and create chunks at those breakpoints. Chunks produced this way are topically coherent — each chunk is about one thing.

The Greg Kamradt semantic chunking implementation uses a percentile threshold on similarity drops: breakpoints are placed where the similarity drop is above the 95th percentile of all drops in the document. This is adaptive to each document's internal coherence structure. Documents with sharp topic transitions produce many small chunks; documents with flowing prose produce fewer larger chunks.

Semantic chunking is most valuable for long, multi-topic documents where fixed or paragraph chunking would mix topics within a chunk. The tradeoff is computational cost: you must embed every sentence before chunking, which adds latency to the ingestion pipeline. For a 100-page PDF, semantic chunking adds ~2–5 seconds of processing versus milliseconds for fixed chunking. Acceptable for batch ingestion, not for real-time document processing.


§5

Hierarchical and Parent-Child Chunking

Parent-child chunking creates two levels of representation: large parent chunks for context and small child chunks for precise retrieval. At retrieval time, you retrieve small child chunks (high precision), but you return the parent chunk to the LLM (high context). This combines the precision of small-chunk retrieval with the context richness of large-chunk generation.

The implementation: split documents into parent chunks of 1000–2000 tokens. Split each parent chunk into child chunks of 100–300 tokens. Index the child chunks in the vector store with a pointer to the parent chunk ID. At query time, retrieve child chunks, dereference to their parent chunks, deduplicate parent chunks, and pass parents to the LLM. The retrieval precision comes from small child embeddings; the context quality comes from large parent text.

Hierarchical chunking is the right pattern for documents where answers require context that is not localized to a single sentence: technical specifications where terms are defined earlier in the section, legal documents where clauses reference definitions from earlier pages, and narrative documents where answers depend on preceding context. The index size cost is low: child chunks are small, and you only store parent text once with a reference.


§6

Document-Structure-Aware Chunking

The most precise chunking strategy uses the document's own structure as chunk boundaries: Markdown headers, HTML section tags, PDF bookmarks, or DOCX section breaks. A section under an H2 heading becomes one chunk. A table becomes one chunk. A code block becomes one chunk. This preserves the author's intended organization and ensures that structurally coherent units are retrieved together.

Implementing structure-aware chunking requires document-type-specific parsers. For Markdown, splitting on heading levels (H1, H2, H3) with a max-size fallback is straightforward. For PDFs, you need a parser that preserves heading hierarchy (PyMuPDF with layout analysis, or Docling). For HTML, use the DOM structure with semantic element boundaries (article, section, nav excluded).

A key advantage of structure-aware chunking is that the heading becomes metadata you can use for filtering and citation. When you retrieve a chunk from the section "Drug Interactions" of a medication guide, you can cite the section name along with the document name. This makes the system's citations verifiable by users, which builds trust.


§7

How to Choose a Chunking Strategy

Start with your document type and query distribution. If your documents are structured (Markdown, HTML, DOCX with headings), use structure-aware chunking. If your documents are unstructured long-form text (PDFs, transcripts), use semantic chunking. If your documents are mixed or you need simplicity, use paragraph chunking with a token-size guardrail. Only use fixed-size chunking if you have a specific reason — it should not be the default.

Layering matters: combine strategies based on query specificity. If users ask precise factual questions (who, what, when), smaller chunks (100–300 tokens) improve precision. If users ask for explanations or summaries (how, why), larger chunks (500–1000 tokens) improve completeness. Parent-child chunking handles both by using small chunks for retrieval and large chunks for context.

Evaluate your chunking strategy using Recall@5 on a held-out query set. Run the same queries against three different chunking strategies, measure which strategy finds the relevant document in the top 5 results most often, and use that strategy. Do not tune chunk size by intuition — it is one of the few RAG parameters you can measure directly against a clear metric.


FAQ

Frequently asked questions

What chunk size should I use for RAG?
The right chunk size depends on your documents and queries. For precise factual queries (what is the value of X?), smaller chunks (100–300 tokens) produce better retrieval precision — the relevant sentence is not buried in noise. For explanatory queries (how does X work?), larger chunks (500–1000 tokens) produce better context completeness. A practical starting point for most production applications: 512 tokens for child chunks in a parent-child setup, with parent chunks of 2048 tokens. Measure Recall@5 on your actual queries and tune from there. Chunk size is one of the few RAG parameters that responds predictably to systematic measurement.
What is semantic chunking?
Semantic chunking uses embedding similarity between adjacent sentences to find natural topic boundaries in a document. Instead of splitting at fixed intervals or paragraph breaks, it embeds each sentence and identifies where the topic changes by looking for drops in cosine similarity between adjacent sentence embeddings. Chunks created this way are topically coherent — each chunk is about one concept. The tradeoff is computational cost during ingestion (you must embed every sentence before you can chunk) and variability in chunk size (different topics have different lengths). Semantic chunking is best for long multi-topic documents where other strategies would mix topics within a single chunk.
What is parent-child chunking?
Parent-child chunking creates two levels of document representation: small child chunks (100–300 tokens) for high-precision retrieval, and large parent chunks (1000–2000 tokens) for rich context delivery to the LLM. During indexing, child chunks are embedded and stored in the vector store with a pointer to their parent chunk. During retrieval, you query using child chunk embeddings (which are precise), then dereference to the parent chunk (which has full context) before passing to the LLM. This gives you the best of both worlds: precise retrieval from small chunks and complete context from large chunks. It is the recommended default for production RAG systems that handle varied query types.
Does chunking strategy affect retrieval recall?
Significantly. Chunking strategy is often the largest single factor in retrieval recall for structured documents. A mismatch between chunk boundaries and answer boundaries means the answer is split across chunks and no single chunk retrieves well for the query. For example, if a drug interaction is described across two sentences that span a fixed-size chunk boundary, neither chunk embeds the full interaction and retrieval recall drops. Structure-aware chunking, which aligns chunk boundaries with document structure, consistently outperforms fixed-size chunking on Recall@5 for structured documents like documentation, specifications, and research papers. The improvement is typically 10–25 percentage points in Recall@5.
How do I evaluate which chunking strategy is best?
Build a small held-out evaluation set of 50–100 query-document pairs: real queries paired with the specific document chunk that contains the correct answer. Run your RAG pipeline with each chunking strategy, measure Recall@5 (what fraction of queries have the correct chunk in the top 5 results), and compare strategies. Use the same embedding model, vector store, and retrieval algorithm across strategies so chunking is the only variable. A 5-percentage-point difference in Recall@5 is typically significant — chunking strategies that differ by less than that are not meaningfully different for your use case. Total evaluation time for a 100-query set across three strategies is typically under one hour using a caching layer.

Next

ragchunkingfundamentals