Skip to content

Retrieval-Augmented Generation (RAG)

Making AI Useful 8 min read

In Short

RAG is a framework that connects a language model to an external knowledge source at inference time, letting it retrieve relevant documents before generating a response. It was introduced by Lewis et al. (2020) to solve three problems that all LLMs share: a frozen knowledge cutoff, a tendency to hallucinate confident-sounding falsehoods, and no access to private or proprietary data.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

RAG prepares a knowledge corpus once through an ingestion pipeline, then answers each user question through a query pipeline that retrieves relevant chunks and grounds the language model's response in them.

01. What It Is

Retrieval-Augmented Generation combines two components that operate at query time rather than at training time. The first is a retrieval system, typically a vector database, that finds relevant passages from a corpus. The second is a language model that uses those passages as grounding context before producing an answer.

The original paper, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis, Ethan Perez, et al. (Facebook AI Research, 2020), demonstrated that pairing a Dense Passage Retrieval (DPR) encoder with a BART generator achieved state-of-the-art results on open-domain question answering benchmarks including Natural Questions, TriviaQA, and WebQuestions, with improvements of 2.8 to 7.9 percentage points over prior methods.

The key insight was framing LLMs as having two kinds of memory: parametric memory (the knowledge baked into model weights during training) and non-parametric memory (dynamically retrieved external data). RAG makes the non-parametric memory swappable and updatable without retraining.

02. Why It Matters

Three specific failure modes motivate RAG.

Knowledge cutoff:
Every LLM is trained on a snapshot of the web or curated data up to a fixed date. Ask it about events after that date and it either admits ignorance or, more dangerously, fabricates a plausible-sounding answer. RAG sidesteps this by retrieving live or recently updated content at query time. The model's weights stay frozen; the knowledge base is updated separately.

Hallucination:
LLMs generate text probabilistically. When the model lacks the exact information needed, it assigns probability mass to the most statistically likely continuation, which may be factually wrong. Grounding the generation in retrieved passages gives the model a "cheat sheet" of verified facts to draw from. Studies show advanced RAG systems can reduce hallucination rates by up to 50% compared to baseline LLM responses.

Private and proprietary data:
Enterprise knowledge, internal documentation, customer records, and research findings never appear in public training data. RAG allows organizations to index this data into a private vector store and connect it to an LLM API without the documents ever leaving their infrastructure or being used to train a shared model. This resolves both the knowledge gap and the data-privacy concern in a single architecture.

03. How It Works

The Ingestion Pipeline

Before any query is served, the knowledge corpus is prepared:

  1. Load. Documents are collected from sources: PDFs, wikis, databases, emails, web pages, code repositories.
  2. Clean. Formatting artifacts, duplicate content, and low-signal text are removed.
  3. Chunk. Documents are split into smaller segments. The chunking strategy (fixed-size, recursive, semantic, or structure-based) directly determines retrieval quality. Typical target sizes range from 256 to 1024 tokens depending on use case.
  4. Embed. Each chunk is passed through an embedding model (such as OpenAI text-embedding-3-large, Cohere Embed v3, or an open-weight model like bge-m3) that converts it to a dense vector of 768 to 1536 dimensions.
  5. Store. Vectors are written to a vector database (Pinecone, Weaviate, Chroma, Milvus, Qdrant, or pgvector) along with the original text and metadata fields such as source URL, document title, section heading, and page number.

This pipeline runs once at ingestion and again whenever the corpus is updated. It does not involve the generative LLM at all.

The Query Pipeline

Every user query triggers this sequence:

  1. Receive query. The user's question or instruction arrives.
  2. Pre-retrieval (optional in advanced RAG). The query may be rewritten for clarity, expanded with synonyms, or decomposed into sub-questions before retrieval.
  3. Embed the query. The same embedding model used during ingestion converts the query into a vector.
  4. Retrieve. The vector database performs approximate nearest-neighbor (ANN) search, returning the top-k most semantically similar chunks. In hybrid RAG, a parallel BM25 keyword search runs simultaneously, and results are merged using Reciprocal Rank Fusion (RRF).
  5. Rerank (optional). A cross-encoder model scores each retrieved chunk against the query for precise relevance, promoting the best candidates and discarding noise.
  6. Augment. Retrieved chunks are assembled into a context block and inserted into the LLM prompt using a template such as: "Answer the question using only the information in the CONTEXT below. CONTEXT: {chunks}. QUESTION: {query}."
  7. Generate. The LLM produces a response grounded in the provided context rather than relying solely on its training weights.
  8. Return. The response, optionally with source citations, is returned to the user.

04. Key Terms and Variants

Naive RAG

The baseline implementation: one retrieval pass using vector similarity, top-k chunks fed directly to the LLM. Fast to build, sufficient for simple factual queries, but degrades on ambiguous questions, domain-specific terminology, and queries requiring multi-step reasoning.

Advanced RAG

Adds multiple refinement layers around the core pipeline:

  • Query rewriting to resolve ambiguity and improve domain-specific matching
  • Hybrid retrieval combining dense vectors with sparse BM25 keyword search
  • Reranking using a cross-encoder for post-retrieval precision
  • Context compression to remove redundant or low-value passages before generation
  • Feedback loops that allow chunks to be scored and improved over time

Advanced RAG is the recommended production default for most applications. Its cost-to-quality ratio outperforms both naive RAG and agentic RAG for the majority of queries.

Agentic RAG

Replaces the fixed one-pass pipeline with an autonomous agent loop. The agent:

  • Decides whether the retrieved context is sufficient
  • Decomposes complex questions into sub-queries
  • Performs multi-hop retrieval (each result informs the next query)
  • Validates retrieved content for contradictions and relevance
  • Routes queries to different tools: vector stores, SQL databases, web search, APIs, code execution

Frameworks supporting agentic RAG include LangGraph (most mature for production), LlamaIndex Agents, Microsoft AutoGen, and CrewAI.

Agentic RAG costs 3 to 10 times more in tokens than advanced RAG. It is only justified for hard multi-step reasoning questions or cross-source synthesis tasks where standard retrieval fails.

Graph RAG

A variant that stores knowledge in a graph structure (nodes are entities, edges are relationships) rather than a flat vector index. Useful for queries that require traversing relationships, such as "what companies are connected to this person through board membership." Microsoft's GraphRAG research (2024) demonstrated gains on community-level summarization and multi-hop reasoning tasks.

Modalities

RAG is not limited to text. Multimodal RAG retrieves images, audio transcripts, tables, or structured data alongside text chunks, allowing the generative model to reason across modalities.

05. Examples as Pipeline Diagrams

Simple factual query: User asks "What is our refund policy?" The query is embedded, the policy document chunks are retrieved from the company vector store, those chunks are inserted into the prompt, and the LLM produces an answer citing the exact policy text.

Multi-hop query (agentic): User asks "Which of our enterprise customers renewed in Q4 and had open support tickets at renewal time?" The agent decomposes this into two sub-queries, retrieves from the CRM database and the support ticket system separately, synthesizes the overlap, and returns a reasoned answer.

Knowledge cutoff bypass: User asks about a regulatory change from last month. A web search tool is called at retrieval time, fresh content is retrieved, and the LLM answers with current information the training data never contained.

06. Long Context vs RAG

As context windows grew into the millions of tokens (Gemini 1.5 Pro reached 1M in early 2024, and Llama 4 was announced at 10M), a reasonable question arose. Why retrieve anything if you can paste an entire knowledge base straight into the prompt? For a single static document that fits the window, this often works well, and it can beat retrieval on questions that need the whole picture at once.

RAG still wins in most production settings for three reasons. Cost and latency. Sending millions of tokens on every query is slow and expensive, while retrieval passes along only the handful of passages that matter. Google researchers (Li et al., 2024) found that long-context models edge out RAG on average accuracy when fully resourced, yet RAG stays dramatically cheaper. Freshness and scale. A corpus that changes daily, or one far larger than any window, cannot be re-pasted on every call. The same knowledge-cutoff and private-data problems described above still apply. Citations. Retrieval returns named sources, so an answer can be traced back to a document.

There is also a quality catch. Chroma's 2025 "Context Rot" report tested 18 leading models and found that accuracy degrades steadily as the input grows, even on simple tasks.
This is the lost-in-the-middle pattern at scale. More tokens is not the same as more reliable understanding.

The practical answer is both.
Hybrid systems route straightforward questions to focused retrieval and reserve the full window for the cases that genuinely need it. The "Self-Route" method from Li et al. does exactly that, matching long-context accuracy at a fraction of the cost.

07. Common Pitfalls

Chunk size mismatch:
Chunks too small lose semantic context; chunks too large dilute the vector representation. Start with 512 tokens and 10 to 20 percent overlap, then benchmark against your specific query distribution.

Wrong embedding model:
General-purpose embeddings underperform on domain-specific language. Fine-tuned or domain-adapted embedding models substantially improve recall on technical, legal, or medical corpora.

Vector-only retrieval missing exact terms:
Product names, error codes, and rare proper nouns require exact keyword matching. Sparse retrieval (BM25) handles these cases. Using vector search alone means losing queries with precise terminology.

Garbage in, garbage out:
Document cleaning and metadata preservation matter as much as retrieval strategy. Chunks stripped of headings and page numbers lose the signals that help both retrieval ranking and citation quality.

No evaluation:
Building a RAG pipeline without a retrieval evaluation harness (measuring Recall@k, MRR, faithfulness, and answer relevance) makes systematic improvement impossible. Treat retrieval quality as a measurable metric from day one.

Skipping reranking:
The embedding model's top-k results are ordered by semantic similarity, not by answer quality. A cross-encoder reranker on the top-20 results before sending to the LLM significantly improves final answer quality for modest additional latency (50 to 200ms).

Over-engineering early:
Agentic RAG is not the right starting point. The recommended progression is: naive RAG to validate the concept, advanced RAG with hybrid retrieval and reranking as the production default, and agentic RAG only when multi-step reasoning is genuinely required and the cost is justified.