Skip to content

Chunking Strategies

Making AI Useful 7 min read

In Short

Chunking is the process of splitting documents into smaller segments before embedding them for retrieval. The strategy you choose determines how well the retrieval system can find the right passage, because embeddings represent meaning at the chunk level. "Your chunking strategy is your retrieval strategy" (Redis Engineering, 2025).

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

A document is cleaned, split into chunks by a chosen strategy (fixed-size and recursive chunking use 10 to 20 percent overlap), tagged with metadata, embedded into a vector database, then at query time small leaf chunks are matched for precision while their larger parent chunks are returned to the language model.

01. What It Is

Before a document can be stored in a vector database, it must be split into segments called chunks. Each chunk is converted to a vector embedding by an embedding model, and that embedding represents the semantic content of the chunk. At query time, the user's query is also embedded, and the most similar chunk vectors are retrieved.

Chunking is not a preprocessing detail. It is a core architectural decision that affects retrieval precision, context quality sent to the LLM, index size, ingestion cost, and latency. A 2025 Vectara study across 25 chunking configurations and 48 embedding models found that chunking configuration influences retrieval quality as much as embedding model choice.

02. Why It Matters

Too small: The embedding lacks sufficient context to represent a coherent idea. Retrieval becomes noisy and individual chunks are too narrow to be useful on their own.

Too large: The embedding averages over many ideas, diluting the signal. The chunk may match on one topic but carry irrelevant content that confuses the LLM.

No overlap: Context at chunk boundaries is lost. A sentence that spans two adjacent chunks may be retrieved without its completing thought.

No metadata: The chunk becomes a floating fragment with no location information, making citation impossible and post-retrieval filtering harder.

The specific failure mode depends on your query distribution. Fact-based queries need tight, specific chunks (64 to 256 tokens). Narrative comprehension queries need broader context (512 to 1024 tokens). There is no universally correct chunk size, only the right size for your workload.

03. How It Works

Chunking is applied during the ingestion pipeline, before embedding. The general sequence:

  1. Load the raw document.
  2. Clean formatting artifacts, remove boilerplate.
  3. Apply the chosen chunking strategy to produce segments.
  4. Attach metadata to each chunk (title, source, section heading, page number).
  5. Embed each chunk.
  6. Store vectors and original text in the vector database.

04. Key Strategies

Fixed-Size Chunking

Splits text at a fixed character, word, or token count regardless of content structure. The simplest approach to implement.

Pros: Fast, predictable, no external calls at ingestion time, good baseline for prototyping.

Cons: Breaks semantic units mid-sentence or mid-paragraph, poor for structured documents.

Recommended for: Homogeneous plain-text corpora where ingestion speed matters more than precision. Use as a starting baseline before evaluating alternatives.

A 2025 Vectara/NAACL paper found that fixed 200-word chunks matched or outperformed semantic chunking on general-purpose retrieval and answer generation tasks, which is a useful reminder to benchmark before assuming complexity is needed.

Recursive Character Chunking

Applies a prioritized list of separators in order: paragraph breaks, then line breaks, then spaces. If a resulting segment exceeds the target size, it is recursively split using the next delimiter in the list.

Pros: Respects natural language structure better than fixed-size while remaining simple. No embedding API calls at ingestion. Commonly used as the production default in LangChain and LlamaIndex.

Cons: Produces variable-length chunks. Does not detect topic shifts.

Recommended for: General-purpose RAG pipelines as the starting point. The benchmark-validated default is 512 tokens with 10 to 20 percent overlap (roughly 50 to 100 tokens).

Semantic Chunking

Uses an embedding model during ingestion to detect topic shifts. Sentence-level embeddings are computed and cosine similarity between adjacent sentences is measured. Where similarity drops below a threshold, a chunk boundary is placed.

Pros: Chunks stay thematically coherent. Respects topic transitions that paragraph breaks do not capture.

Cons: Requires embedding API calls during ingestion (higher cost and latency). Research results are mixed: a NAACL 2025 paper found that computational costs are not consistently justified by retrieval gains across general corpora. A domain-specific clinical study (MDPI Bioengineering, November 2025) found that adaptive chunking aligned to logical topic boundaries hit 87% accuracy versus 13% for fixed-size, illustrating that the benefit is domain-dependent.

Recommended for: High-value corpora where semantic coherence is critical (clinical notes, scientific papers, legal documents) and where you can afford to benchmark the gain. Always compare against recursive chunking on your specific corpus before committing.

Document-Structure-Based Chunking

Derives boundaries from the document's own structure: headings, sections, paragraphs, HTML tags, markdown headers, or table rows. Each structural unit becomes a chunk, up to a maximum size.

Pros: Preserves semantic meaning from the document hierarchy. Keeps tables, code blocks, and numbered lists intact.

Cons: Requires structured input. Not applicable to raw unstructured text.

Recommended for: Legal filings, technical manuals, API documentation, financial reports, HTML pages with clear heading hierarchies.

Sentence and Paragraph Chunking

Chunks are aligned to natural linguistic units: individual sentences or single paragraphs. Often used as a base level for hierarchical strategies.

Pros: Each chunk has natural grammatical completeness. Works well with sentence-transformer embedding models that were trained on sentence-length inputs.

Cons: Sentences in isolation may lose context from their surrounding paragraph. Paragraphs can vary widely in length.

LLM-Driven (Agentic) Chunking

An LLM converts the document into atomic propositions (standalone statements that each convey a single fact), then groups those propositions into coherent chunks.

Pros: Highest semantic precision. Handles documents with irregular structure that confuse rule-based approaches.

Cons: Highest cost and latency. Requires a well-crafted extraction prompt. LLM output quality determines chunk quality.

Recommended for: High-value, low-volume corpora where ingestion cost is acceptable. Examples: financial earnings call transcripts, complex regulatory documents.

05. Overlap

Overlap is the number of tokens repeated between the end of one chunk and the start of the next. Its purpose is to preserve context at chunk boundaries so that a sentence split across two segments is readable in both.

A standard starting point is 10 to 20 percent overlap. For 512-token chunks, that means roughly 50 to 100 tokens of overlap. Higher overlap increases index size and may introduce retrieval duplication. Lower overlap risks missing boundary context. Treat overlap as a tunable hyperparameter alongside chunk size.

06. Chunk Size Tradeoffs

Query type Recommended size Rationale
Precise fact lookup 64 to 256 tokens Tight scope, improves fact recall by 10 to 15%
Narrative or comprehension 512 to 1024 tokens Broader context preserves reasoning flow
General baseline 512 tokens Balanced starting point for most pipelines
Context fed to LLM Under 2500 tokens total Beyond this, generation quality degrades

The best practice for production systems is to index small chunks for retrieval precision and return the surrounding larger parent chunk to the LLM for generation context. This is called parent-document retrieval.

07. Metadata

Every chunk should carry metadata fields attached at ingestion time:

  • Document title
  • Section heading at the time of chunking
  • Page number or character offset
  • Source URL or file path
  • Date or version of the source document
  • Any domain-specific identifiers (case number, product SKU, article ID)

Metadata serves two purposes. First, it enables hybrid filtering during retrieval (retrieve only chunks from documents dated after a certain date, or from a specific product category). Second, it allows the system to return citations alongside generated answers. A 2025 study found that metadata enrichment on chunks boosts QA accuracy from roughly 50 to 60 percent to 72 to 75 percent without any change to the retrieval architecture.

08. Parent-Document Retrieval

A pattern introduced in LangChain and LlamaIndex that addresses the tension between retrieval precision and generation context. Small leaf chunks (64 to 256 tokens) are embedded and indexed for retrieval. Each leaf chunk carries a reference to its parent chunk (512 to 2048 tokens) stored separately. At query time:

  1. Retrieve top-k small leaf chunks by vector similarity.
  2. Follow the parent reference for each matched leaf.
  3. Return the larger parent chunks to the LLM.

This gives retrieval the precision benefits of small chunks and gives generation the contextual breadth of large chunks. LlamaIndex's Auto-merging Retriever implements a variant: if enough leaf chunks from the same parent are retrieved, they are automatically merged into the parent before being sent to the LLM.

09. Emerging Techniques

Late chunking: The full document is embedded first (preserving long-range context in the attention mechanism), then carved into segments afterward. Shows roughly 3 percent relative improvement on long-document retrieval. Constrained by the embedding model's context window.

Contextual retrieval (Anthropic, 2024): A short LLM-generated summary of the surrounding document is prepended to each chunk before embedding. Reduces retrieval failures by anchoring the chunk in its larger context. Cost: one LLM call per chunk at ingestion time.

Pseudo-instruction chunking (PIC): Uses a document-level summary to guide boundaries without per-chunk LLM calls. Across six QA datasets: PIC 58.4 hits@5 vs. semantic 56.0 vs. fixed-size 54.5.

10. Common Pitfalls

Chunks under 200 characters:
The embedding model has insufficient text to represent a coherent idea. Retrieval becomes noisy.

Chunks spanning multiple topics:
The vector representation is averaged over unrelated ideas. Retrieval matches on one topic but the chunk delivers mixed content to the LLM.

Ignoring document structure:
Splitting a table mid-row or separating a heading from its body paragraph destroys meaning that was structurally encoded.

Stripping metadata:
A chunk without source attribution cannot be cited and cannot be filtered by date, author, or category.

No benchmarking:
Choosing a chunking strategy without measuring its retrieval quality on representative queries is guessing. Implement a retrieval evaluation harness (Recall@k, MRR, answer faithfulness) and treat chunking configuration as a tunable parameter.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Chunking
Splitting documents into smaller segments before embedding them for retrieval.
Overlap
Tokens repeated between the end of one chunk and the start of the next to preserve boundary context.
Parent-document retrieval
Index small chunks for precision, return the larger parent chunk to the LLM for context.
Semantic Chunking
Uses an embedding model to place boundaries where cosine similarity between adjacent sentences drops.
Metadata
Fields attached to each chunk at ingestion enabling hybrid filtering and citations.

Tags

#chunking #rag #retrieval #embeddings #vector-database #llm

More in RAG & Retrieval