04. Key Terms and Variants
Top-k Retrieval
The simplest retrieval policy: return the k chunks with the highest similarity scores. k is a hyperparameter. Common values are 5, 10, or 20.
Higher k improves recall (more likely to include the relevant passage) but increases the context window size and generation cost, and may introduce irrelevant passages that confuse the LLM.
A practical pattern: retrieve top-20 from the index, rerank to top-5, send top-5 to the LLM.
Maximum Marginal Relevance (MMR)
MMR is a post-retrieval diversity algorithm introduced by Carbonell and Goldstein (1998). It selects documents iteratively, balancing two objectives: relevance to the query and diversity from already-selected documents. A document has high marginal relevance if it is both relevant to the query and substantially different from the documents already in the result set.
This prevents retrieving five chunks that all say the same thing from different sections of the same source document. MMR is implemented in LangChain, Azure AI Search, and Elasticsearch.
MMR has a lambda parameter controlling the relevance-diversity tradeoff. Lambda = 1 is pure relevance (same as standard top-k). Lambda = 0 is pure diversity. Values around 0.5 to 0.7 are typical in RAG applications.
Reranking (Cross-Encoders)
The first-stage retrieval (BM25, vector, or hybrid) uses bi-encoders or term-frequency statistics that process query and document independently. This is fast but imprecise: the query and document are not attending to each other during scoring.
A cross-encoder reranker takes the query and each candidate document together as a single input, allowing the model to attend to the relationship between them. This produces more accurate relevance scores at the cost of higher latency.
The standard production pattern is a two-stage pipeline:
- First-stage retrieval: hybrid search over the full index, returning top-50 candidates (fast, approximate)
- Second-stage reranking: cross-encoder scores each of the top-50 against the query, promoting the best 5 (slower, precise)
Latency overhead for reranking is typically 50 to 200ms. Leading reranker models include Cohere Rerank v3.5, Voyage rerank-2.5, and open-weight models like bge-reranker-v2-m3.
As of August 2025, Voyage rerank-2.5 introduced instruction-following reranking: you can prepend a natural-language instruction to steer relevance judgment (for example, "prefer results with regulatory compliance citations"), adding task-specific precision without retraining.
Learned Sparse Models
A middle ground between BM25 and dense retrieval. Models like SPLADE (Sparse Lexical and Expansion) use a transformer to assign term weights, enabling semantic expansion while maintaining sparse representation. SPLADE can match "car" to "automobile" via learned term weights while still using an inverted index for fast lookup.
ColBERT is another variant that uses per-token vectors and MaxSim scoring (maximum similarity across all token pairs), providing fine-grained matching that outperforms standard bi-encoders on some benchmarks. ColBERT is used in production by several enterprise search systems.
Query Expansion and Rewriting
Pre-retrieval transformations applied to the user query before it reaches the retrieval layer.
Query rewriting: An LLM rephrases the query to be more precise, resolve ambiguity, or use domain-specific terminology that better matches indexed document language.
Query expansion: The original query is supplemented with synonyms, related terms, or alternate phrasings. In multi-query retrieval (LangChain / LlamaIndex), the LLM generates 3 to 5 different query formulations, each is run through retrieval independently, and results are merged by union or RRF.
A 2025 ACM paper on multi-query rewriting found that generating multiple expressions covering different potential interpretations of the query expanded search scope and improved comprehensiveness significantly over single-query retrieval.
Hypothetical document embedding (HyDE): The LLM generates a hypothetical answer to the query. That answer is embedded and used for retrieval rather than the query itself. The hypothesis is often a better match to indexed document language than the bare question.