Vector Databases

In Short

A vector database stores high-dimensional numeric vectors (embeddings) and retrieves the ones most similar to a query vector, which is the retrieval step at the heart of every RAG pipeline. Normal databases compare rows by exact match or range. Vector databases compare points by geometric proximity, at scale, in milliseconds.

01. What It Is

An embedding is a dense numeric array that represents meaning. An embedding model converts a sentence, image, or document chunk into a vector of 768 to 1536 floating-point numbers (depending on the model), where semantically related inputs land close together in that high-dimensional space.
See the Embeddings file for the full treatment.

A vector database is a storage and retrieval system built specifically to serve queries of the form: "given this query vector, return the k stored vectors most similar to it." That operation is called k-nearest neighbor (k-NN) search. It underpins semantic search, RAG retrieval, recommendation systems, deduplication, anomaly detection, and agent memory.

A relational database like Postgres stores rows and retrieves them by matching column values exactly or via range predicates. It has no concept of "close to." A vector database's entire indexing layer is built to answer proximity queries efficiently. Some relational databases have added vector extensions (notably Postgres via pgvector), but a purpose-built vector database optimizes the storage format, indexing structure, and query path entirely around this one operation.

02. Why It Matters

Without vector search, semantic retrieval requires comparing a query embedding against every stored embedding one by one. At 10 million vectors of 1536 dimensions each, that is around 60 GB of floating-point data to scan per query. Brute-force search on that scale takes seconds, not milliseconds, and does not scale to production RAG systems. Vector databases solve this with approximate nearest neighbor (ANN) indexing structures that reduce search time to single-digit milliseconds at the cost of a small, controllable accuracy trade-off.

They also add the surrounding infrastructure: persistence, metadata storage, filtering, updates, replication, and API layers that raw search libraries lack.

03. How It Works

Exact vs. approximate nearest neighbor

Exact k-NN computes the true nearest neighbors by comparing the query to every vector. Accuracy is 100%. Speed does not scale.

Approximate nearest neighbor (ANN) builds an index structure that enables the search to skip most of the vector space and examine only a promising subset. Recall (what fraction of the true top-k are returned) is typically 95-99%, which is sufficient for retrieval tasks. The two dominant ANN index types are HNSW and IVF.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph during ingestion. At the bottom layer every vector is a node, connected to its nearest neighbors. At each higher layer the graph becomes sparser, containing only a random subset of vectors. The structure resembles a highway system: top layers are express roads for fast long-distance navigation, lower layers are local roads for fine-grained search.

At query time, the search starts at the top layer, greedy-walks toward the query's neighborhood, drops to the next layer, repeats, and terminates at the bottom layer with a candidate set. This typically achieves 98%+ recall with query latency around 2-15ms at 10 million vectors.

HNSW uses more memory than IVF (the graph connections must be stored) and build time is slower, but query performance is consistently strong across diverse query distributions.

IVF (Inverted File)

IVF clusters the vector space using k-means during index construction, producing a set of centroids. Each stored vector is assigned to its nearest centroid. At query time, the system computes distances from the query to all centroids, selects the top nprobe nearest clusters, and does exact search only within those clusters.

IVF uses less memory than HNSW, builds faster, and performs better under heavy metadata filtering (because candidate sets are well-defined). Recall degrades if the data distribution shifts after index build, requiring periodic re-clustering.

Product Quantization (PQ)

Product Quantization is a compression technique applied on top of either index type. It splits each high-dimensional vector into sub-vectors and represents each sub-vector by a short code pointing to an entry in a learned codebook. A 1536-dimension float32 vector (6 KB) can be compressed to a few dozen bytes with a 64:1 ratio. This allows billions of vectors to fit in RAM that would otherwise require disk. Search runs on compressed codes with an optional re-scoring step on the original vectors for the top candidates.

Distance metrics

Cosine similarity measures the angle between two vectors, ignoring magnitude. It is the standard for text embeddings because embedding norms often carry length information that is not meaningful for semantic similarity. Cosine similarity = dot product of unit-normalized vectors.

Dot product (inner product) measures both angle and magnitude. Suitable when magnitude is meaningful, such as in recommendation systems where magnitude might encode popularity or confidence.

Euclidean distance (L2) measures straight-line distance in vector space. More sensitive to magnitude differences. Common in image and vision embeddings.

Most vector databases support all three. For text-based RAG with OpenAI or Cohere embeddings, cosine similarity is the default and correct choice.

04. Key Options in 2026

Pinecone

Fully managed, offering both a usage-based serverless tier and pod-based options. No infrastructure to operate. The simplest path from zero to production vector search. Supports hybrid search (vector + keyword), metadata filtering, and namespaces for data isolation. Latency typically in the tens of milliseconds at 10M vectors. Costs compound at scale: roughly $70/month at 10M vectors, $700+/month at 100M vectors. The right default for teams that want managed infrastructure and are not yet cost-constrained.

Weaviate

Open-source, self-hosted or managed cloud. The strongest native hybrid search story in the field: vector similarity, BM25 keyword scoring, and metadata filtering compose natively via GraphQL. HNSW index. Module system supports built-in vectorizers (OpenAI, Cohere, Hugging Face) so you can ingest raw text and let Weaviate call the embedding API. Good choice for production RAG where hybrid search quality matters and you can operate infrastructure.

Qdrant

Open-source, written in Rust. Fastest open-source option: p50 latency around 4ms, 10-25% faster than Weaviate or Milvus on common benchmarks. Strong metadata filtering, quantization support, and hybrid search (vector + sparse). gRPC API available for lower overhead. The go-to open-source choice when query speed is the dominant concern.

Chroma

Open-source, developer-first. Runs embedded in-process (no separate service) for prototyping and light production. Also offers a hosted cloud tier. The lowest setup friction of any vector database: pip install, three lines of code, done. Metadata filtering and basic hybrid search included. Appropriate for applications under a few million vectors and for rapid prototyping before committing to a production stack.

Milvus

Open-source, distributed architecture. Designed for billion-scale deployments. Separates compute and storage so each layer scales independently. Supports HNSW, IVF, and their quantized variants. Managed cloud version is Zilliz Cloud. The right choice for large-scale enterprise workloads that exceed what Qdrant or Weaviate handle comfortably.

pgvector (Postgres)

An extension that adds vector storage and ANN indexing to Postgres. Supports both HNSW (added in pgvector 0.5.0) and IVFFlat indexes. Metadata filtering uses standard SQL, which is more expressive than any purpose-built vector DB filter syntax. Latency is higher than purpose-built databases: 25-40ms depending on configuration, with friction appearing above 10-50M vectors. The canonical recommendation from the field in 2026 is to default to pgvector unless scale or latency requirements demand more. Teams already on Postgres avoid an entirely separate service, operational overhead, and data synchronization complexity.

FAISS

FAISS (Facebook AI Similarity Search, now Meta) is a C++ library with Python bindings for in-memory ANN search. It is not a database. There is no persistence, no API, no metadata storage, no client-server separation. It is the search engine that many vector databases use internally. Use FAISS directly when you need maximum performance control in a research or batch processing context and do not need the surrounding database features.

General databases with vector support

Several general-purpose databases have added vector search: Redis (via the RediSearch module), MongoDB Atlas Vector Search, Elasticsearch / OpenSearch (HNSW-based vector fields since 8.x), and Google AlloyDB and Cloud SQL. These are worth considering when the application already depends on one of these systems and avoids adding another infrastructure component, with the understanding that vector performance may lag behind purpose-built options at scale.

05. Managed vs. Self-Hosted

Managed (Pinecone, Weaviate Cloud, Zilliz Cloud): no servers to run, automatic scaling, SLA-backed uptime. Higher per-vector cost. Data leaves your infrastructure (relevant for compliance-sensitive workloads).

Self-hosted (Qdrant, Milvus, Weaviate open-source, pgvector): full control, lower marginal cost at scale, data stays in your environment. You own availability, upgrades, and backups.

The break-even point where self-hosted costs less than managed typically sits somewhere between 20M and 100M vectors, depending on query volume and team operational capacity.

06. Metadata Filtering

Most retrieval tasks need to narrow the vector search by structured attributes: filter to documents from a specific date range, belonging to a specific user, tagged with a specific category. Metadata filtering lets you combine a vector similarity query with predicate filters.

Implementation varies significantly across databases. pgvector uses SQL WHERE clauses. Qdrant uses a purpose-built filter DSL. Weaviate uses GraphQL filters. The operational implication is that pre-filtering (apply the metadata filter before ANN search) reduces the candidate pool and can hurt recall if the pool shrinks too much. Post-filtering (run ANN search, then apply filters) can waste compute examining irrelevant vectors. Most modern databases handle this with hybrid strategies, and IVF indexes generally handle filtered search more stably than HNSW under high filtering ratios.

07. Hybrid Search

Hybrid search combines dense vector similarity with sparse keyword scoring (BM25) to address each method's failure mode. Vector search finds semantically related content but misses exact terminology matches. BM25 matches exact terms but misses paraphrased meaning. Production RAG systems should use hybrid search by default.

Results are typically merged using Reciprocal Rank Fusion (RRF), which combines ranked lists without requiring score normalization. Weaviate, Qdrant, and Milvus support native hybrid search. Pinecone supports hybrid search via sparse-dense index. pgvector supports hybrid search when combined with the pg_textsearch or ParadeDB extensions for BM25.
See Retrieval Methods for a full treatment of hybrid search.

08. How to Choose

Start with this decision tree:

Already on Postgres and under 10M vectors: use pgvector. Add infrastructure only when scale demands it.
Need fully managed with no infra work: Pinecone for general use, Weaviate Cloud if hybrid search quality is critical.
Need self-hosted, speed is paramount: Qdrant.
Need self-hosted, hybrid search composition is paramount: Weaviate.
Need billion-scale self-hosted: Milvus.
Prototyping or building a demo: Chroma.

The most common mistake is over-engineering the vector database choice early. Chroma or pgvector can carry most applications to production. Re-evaluate at scale.

09. Common Pitfalls and Misconceptions

"The vector database is the hard part of RAG."
Retrieval quality is more sensitive to chunking strategy, embedding model choice, and hybrid search configuration than to the vector database brand.

"More dimensions means better search."
Higher-dimensional embeddings capture more nuance but are more expensive to store and search. The embedding model's quality matters more than dimensionality.

"Cosine similarity is always correct."
It depends on the embedding model's training objective. Some models (particularly those trained with dot-product loss) perform better with raw dot product. Check the model's documentation.

"HNSW is always faster than IVF."
HNSW is faster for unfiltered search. Under heavy metadata filtering, IVF can outperform HNSW because the cluster structure aligns well with the filtered subset.