04. Key Strategies
Fixed-Size Chunking
Splits text at a fixed character, word, or token count regardless of content structure. The simplest approach to implement.
Pros: Fast, predictable, no external calls at ingestion time, good baseline for prototyping.
Cons: Breaks semantic units mid-sentence or mid-paragraph, poor for structured documents.
Recommended for: Homogeneous plain-text corpora where ingestion speed matters more than precision. Use as a starting baseline before evaluating alternatives.
A 2025 Vectara/NAACL paper found that fixed 200-word chunks matched or outperformed semantic chunking on general-purpose retrieval and answer generation tasks, which is a useful reminder to benchmark before assuming complexity is needed.
Recursive Character Chunking
Applies a prioritized list of separators in order: paragraph breaks, then line breaks, then spaces. If a resulting segment exceeds the target size, it is recursively split using the next delimiter in the list.
Pros: Respects natural language structure better than fixed-size while remaining simple. No embedding API calls at ingestion. Commonly used as the production default in LangChain and LlamaIndex.
Cons: Produces variable-length chunks. Does not detect topic shifts.
Recommended for: General-purpose RAG pipelines as the starting point. The benchmark-validated default is 512 tokens with 10 to 20 percent overlap (roughly 50 to 100 tokens).
Semantic Chunking
Uses an embedding model during ingestion to detect topic shifts. Sentence-level embeddings are computed and cosine similarity between adjacent sentences is measured. Where similarity drops below a threshold, a chunk boundary is placed.
Pros: Chunks stay thematically coherent. Respects topic transitions that paragraph breaks do not capture.
Cons: Requires embedding API calls during ingestion (higher cost and latency). Research results are mixed: a NAACL 2025 paper found that computational costs are not consistently justified by retrieval gains across general corpora. A domain-specific clinical study (MDPI Bioengineering, November 2025) found that adaptive chunking aligned to logical topic boundaries hit 87% accuracy versus 13% for fixed-size, illustrating that the benefit is domain-dependent.
Recommended for: High-value corpora where semantic coherence is critical (clinical notes, scientific papers, legal documents) and where you can afford to benchmark the gain. Always compare against recursive chunking on your specific corpus before committing.
Document-Structure-Based Chunking
Derives boundaries from the document's own structure: headings, sections, paragraphs, HTML tags, markdown headers, or table rows. Each structural unit becomes a chunk, up to a maximum size.
Pros: Preserves semantic meaning from the document hierarchy. Keeps tables, code blocks, and numbered lists intact.
Cons: Requires structured input. Not applicable to raw unstructured text.
Recommended for: Legal filings, technical manuals, API documentation, financial reports, HTML pages with clear heading hierarchies.
Sentence and Paragraph Chunking
Chunks are aligned to natural linguistic units: individual sentences or single paragraphs. Often used as a base level for hierarchical strategies.
Pros: Each chunk has natural grammatical completeness. Works well with sentence-transformer embedding models that were trained on sentence-length inputs.
Cons: Sentences in isolation may lose context from their surrounding paragraph. Paragraphs can vary widely in length.
LLM-Driven (Agentic) Chunking
An LLM converts the document into atomic propositions (standalone statements that each convey a single fact), then groups those propositions into coherent chunks.
Pros: Highest semantic precision. Handles documents with irregular structure that confuse rule-based approaches.
Cons: Highest cost and latency. Requires a well-crafted extraction prompt. LLM output quality determines chunk quality.
Recommended for: High-value, low-volume corpora where ingestion cost is acceptable. Examples: financial earnings call transcripts, complex regulatory documents.