Skip to content

Retrieval Methods

Making AI Useful 8 min read

In Short

Retrieval is the step in a RAG pipeline that finds the passages most likely to contain the answer to a query. There are two fundamental approaches: sparse retrieval (keyword matching via BM25) and dense retrieval (semantic similarity via vectors). Each fails where the other succeeds. Hybrid search combines both, and cross-encoder reranking applies a final precision pass. Together these form the current production standard.

01. What It Is

Retrieval is the bridge between a user query and the knowledge base. Its job is to return the top-k most relevant documents or chunks from an index, given a query, in milliseconds. The quality of the final LLM answer is bounded by retrieval quality. If the right passage is not retrieved, no amount of generation sophistication can recover the correct answer.

The two dominant retrieval paradigms are:

  • Sparse retrieval (lexical, keyword-based): represents documents as term-frequency vectors in a vocabulary-sized space. BM25 is the standard algorithm.
  • Dense retrieval (semantic, vector-based): represents documents and queries as dense low-dimensional vectors produced by a neural encoder. Similarity is measured as cosine distance or dot product.

02. Why It Matters

The failure modes of sparse and dense retrieval are complementary, which is why combining them works. BM25 excels at exact terminology and fails on paraphrased meaning. Dense retrieval excels at semantic similarity and fails on rare named entities and precise technical terms. A hybrid system captures both.

For production RAG systems, retrieval quality measured as Recall@5 (the fraction of relevant documents that appear in the top 5 results) directly determines ceiling answer quality. A 2025 arXiv benchmark study ("From BM25 to Corrective RAG") found that a two-stage hybrid pipeline with neural reranking achieved Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin.

03. How It Works

Sparse Retrieval (BM25)

BM25 (Best Match 25) is a probabilistic ranking function from the Okapi BM25 family developed in the 1990s. It scores documents based on the overlap between query terms and document terms, adjusted for term frequency saturation and document length normalization.

Conceptually, it asks: "How many times does this query term appear in this document, discounted for documents that are unusually long?"

Documents are stored in an inverted index: a mapping from each term in the vocabulary to the list of documents containing it, along with positional and frequency data. Query execution involves looking up each query term, retrieving its posting list, and computing BM25 scores in parallel.

Strengths:

  • Exact term matching: precise on product names, error codes, function signatures, rare proper nouns
  • Extremely fast: sub-millisecond lookup on commodity hardware
  • Scales to billions of documents
  • Simple real-time updates without re-embedding
  • No GPU required

Weaknesses:

  • No semantic understanding: "automobile" and "car" have zero lexical overlap
  • Fails on paraphrases, synonyms, and conceptual queries
  • Sensitive to query vocabulary matching document vocabulary

Dense Retrieval (Vector Search)

An encoder model (typically a bi-encoder transformer, such as Cohere Embed v3, OpenAI text-embedding-3-large, or bge-m3) converts both documents and queries into dense vectors of 768 to 1536 dimensions. Documents are indexed offline at ingestion time. Queries are encoded at request time.

Retrieval is performed as an approximate nearest-neighbor (ANN) search over the vector index. Common ANN algorithms include HNSW (Hierarchical Navigable Small World graphs), FAISS, and ScaNN. Vector databases such as Pinecone, Weaviate, Chroma, Milvus, Qdrant, and pgvector all implement ANN search.

Strengths:

  • Captures semantic meaning: handles paraphrases, synonyms, and conceptual queries
  • Works in multilingual and cross-lingual settings
  • Generalizes across phrasings of the same underlying question

Weaknesses:

  • Fails on rare named entities and precise technical terms (the vector for "XRT-4421 error code" is nearly meaningless)
  • Black-box embeddings are hard to debug
  • Requires GPU or high RAM for ANN index serving
  • Index rebuilds are expensive (hours for tens of millions of documents)
  • Catastrophic forgetting: the embedding model must be re-run on all documents if you switch embedding models

Hybrid Search

Hybrid retrieval runs BM25 and dense vector search in parallel on the same query. Results from both are merged into a single ranked list.

The standard fusion algorithm is Reciprocal Rank Fusion (RRF). Each document receives a score from each retriever of 1 / (k + rank), where k is a smoothing constant (typically 60) and rank is the document's position in that retriever's result list. Scores are summed across retrievers. RRF works because it operates on ranks rather than incompatible raw scores, avoiding the normalization problem of mixing BM25 scores (term frequencies) with cosine similarity values (floats between -1 and 1).

Hybrid search adds roughly 5 to 20ms latency over dense-only retrieval and consistently outperforms either method alone.

Practical guidance:

  • Technical corpora (code, product documentation, legal citations): use equal-weight hybrid
  • Prose-heavy corpora (narrative articles, general knowledge): use dense-dominant weighting with sparse as a safety net
  • When you observe retrieval failures on exact product names or codes: that is the signal to add BM25 if you have not already

04. Key Terms and Variants

Top-k Retrieval

The simplest retrieval policy: return the k chunks with the highest similarity scores. k is a hyperparameter. Common values are 5, 10, or 20.

Higher k improves recall (more likely to include the relevant passage) but increases the context window size and generation cost, and may introduce irrelevant passages that confuse the LLM.

A practical pattern: retrieve top-20 from the index, rerank to top-5, send top-5 to the LLM.

Maximum Marginal Relevance (MMR)

MMR is a post-retrieval diversity algorithm introduced by Carbonell and Goldstein (1998). It selects documents iteratively, balancing two objectives: relevance to the query and diversity from already-selected documents. A document has high marginal relevance if it is both relevant to the query and substantially different from the documents already in the result set.

This prevents retrieving five chunks that all say the same thing from different sections of the same source document. MMR is implemented in LangChain, Azure AI Search, and Elasticsearch.

MMR has a lambda parameter controlling the relevance-diversity tradeoff. Lambda = 1 is pure relevance (same as standard top-k). Lambda = 0 is pure diversity. Values around 0.5 to 0.7 are typical in RAG applications.

Reranking (Cross-Encoders)

The first-stage retrieval (BM25, vector, or hybrid) uses bi-encoders or term-frequency statistics that process query and document independently. This is fast but imprecise: the query and document are not attending to each other during scoring.

A cross-encoder reranker takes the query and each candidate document together as a single input, allowing the model to attend to the relationship between them. This produces more accurate relevance scores at the cost of higher latency.

The standard production pattern is a two-stage pipeline:

  1. First-stage retrieval: hybrid search over the full index, returning top-50 candidates (fast, approximate)
  2. Second-stage reranking: cross-encoder scores each of the top-50 against the query, promoting the best 5 (slower, precise)

Latency overhead for reranking is typically 50 to 200ms. Leading reranker models include Cohere Rerank v3.5, Voyage rerank-2.5, and open-weight models like bge-reranker-v2-m3.

As of August 2025, Voyage rerank-2.5 introduced instruction-following reranking: you can prepend a natural-language instruction to steer relevance judgment (for example, "prefer results with regulatory compliance citations"), adding task-specific precision without retraining.

Learned Sparse Models

A middle ground between BM25 and dense retrieval. Models like SPLADE (Sparse Lexical and Expansion) use a transformer to assign term weights, enabling semantic expansion while maintaining sparse representation. SPLADE can match "car" to "automobile" via learned term weights while still using an inverted index for fast lookup.

ColBERT is another variant that uses per-token vectors and MaxSim scoring (maximum similarity across all token pairs), providing fine-grained matching that outperforms standard bi-encoders on some benchmarks. ColBERT is used in production by several enterprise search systems.

Query Expansion and Rewriting

Pre-retrieval transformations applied to the user query before it reaches the retrieval layer.

Query rewriting: An LLM rephrases the query to be more precise, resolve ambiguity, or use domain-specific terminology that better matches indexed document language.

Query expansion: The original query is supplemented with synonyms, related terms, or alternate phrasings. In multi-query retrieval (LangChain / LlamaIndex), the LLM generates 3 to 5 different query formulations, each is run through retrieval independently, and results are merged by union or RRF.

A 2025 ACM paper on multi-query rewriting found that generating multiple expressions covering different potential interpretations of the query expanded search scope and improved comprehensiveness significantly over single-query retrieval.

Hypothetical document embedding (HyDE): The LLM generates a hypothetical answer to the query. That answer is embedded and used for retrieval rather than the query itself. The hypothesis is often a better match to indexed document language than the bare question.

05. Evaluating Retrieval Quality

Treating retrieval as a black box is the most common mistake in RAG development. These metrics should be tracked:

  • Recall@k: What fraction of the relevant documents appear in the top-k results? The primary metric for retrieval coverage.
  • Precision@k: Of the top-k results, what fraction are actually relevant? Measures noise.
  • MRR (Mean Reciprocal Rank): Where does the first relevant document appear in the ranked list? Measures how quickly the system surfaces the right answer.
  • NDCG (Normalized Discounted Cumulative Gain): Accounts for graded relevance, not just binary relevant/irrelevant.
  • Faithfulness and answer relevance: Downstream metrics measuring whether the LLM's final answer is grounded in the retrieved context and answers the original question.

RAGAS is a widely used open-source framework for evaluating RAG pipeline quality across these dimensions.

If you do not have labeled query-document pairs for evaluation, LLMs can be used to generate synthetic evaluation sets from your corpus, bootstrapping an evaluation harness before any real user queries are available.

06. Common Pitfalls

Vector-only retrieval on technical corpora:
Product names, error codes, version numbers, and function names require exact lexical matching. BM25 must be in the stack.

No reranking:
First-stage retrieval optimizes for recall. Reranking is required for precision. Without it, the top-k chunks sent to the LLM include significant noise.

Fixed top-k without evaluation:
Top-5 may be right for one query distribution and wrong for another. Measure Recall@k on your specific corpus before committing to a k value.

No query transformation:
User queries are often short, ambiguous, and poorly matched to document language. Query rewriting or multi-query expansion consistently improves recall with minimal latency overhead.

Treating the embedding model as a constant:
General-purpose embedding models underperform on domain-specific language. Fine-tuning or selecting a domain-adapted embedding model (such as a legal or biomedical embedding model) can substantially improve retrieval quality. When labeled data is unavailable, synthetic query-document pairs generated by an LLM can bootstrap embedding fine-tuning.