Recommendation Systems

In Short

A recommender system filters a large catalog to surface items most relevant to a specific user, using patterns in behavior and content. The field spans classical statistical methods like collaborative filtering and matrix factorization through to modern two-tower neural networks and, most recently, LLM-based generative recommendation.

01. What It Is

A recommender system is an information filtering system that predicts a user's preference for items and surfaces a ranked subset of them. The term covers any automated pipeline that answers "what should this user see next?" at scale.

The problem has two distinct phases:

Retrieval (candidate generation):
Narrow millions of items down to hundreds of plausible candidates quickly.
Ranking:
Score and order those candidates with a more expensive model using richer features.

Netflix, Spotify, Amazon, and YouTube all run multi-stage pipelines along these lines. The retrieval stage prioritizes recall. The ranking stage prioritizes precision.

02. Why It Matters

75% of what Netflix members watch comes from some form of recommendation, according to the Netflix tech blog. Amazon's item-to-item collaborative filtering ("customers who bought X also bought Y") is estimated to drive a significant share of revenue. Without recommendation systems, users face the paradox of choice across catalogs of millions of items and typically abandon the platform.

The problem is also computationally interesting: a system must personalize in real time for hundreds of millions of users across catalogs that change daily, while satisfying latency budgets measured in milliseconds.

03. How It Works

Collaborative filtering

Collaborative filtering (CF) recommends items based on the behavior of similar users or items, without needing to understand what an item is.

User-based CF:
Find users with similar rating or interaction histories, then recommend items they liked that the target user has not seen. Similarity is typically computed with cosine similarity or Pearson correlation. The main drawback is scalability: computing pairwise similarity across millions of users is expensive.

Item-based CF:
Compute similarity between items based on how users rated them. Amazon popularized this approach. It is more stable over time because item-item relationships change less frequently than user-item interactions.

Matrix factorization:
The dominant CF method since the Netflix Prize (2006-2009). The user-item interaction matrix is decomposed into two lower-dimensional matrices: one containing user latent factors and one containing item latent factors. The dot product of a user's and an item's latent vectors predicts the interaction. Simon Funk's implementation of this approach, later called Funk MF, achieved state-of-the-art RMSE on the Netflix Prize dataset. SVD++ extends Funk MF by incorporating implicit feedback (clicks, views) alongside explicit ratings.

CF problems are cold start, sparsity, and scalability. A brand-new user or item has no interaction history, so the model cannot generate useful embeddings.

Content-based filtering

Content-based methods recommend items similar to items the user has interacted with, using item features. Pandora's Music Genome Project tagged each song with 450 attributes and built stations by finding songs with similar attribute vectors to those a user liked. The approach handles cold-start on items well (a new song can be tagged immediately) but tends toward over-specialization: it cannot recommend genuinely surprising items.

Hybrid systems

Most production systems combine both. Netflix uses collaborative signals (what similar users watched) alongside content signals (genre, director, cast) and contextual signals (device, time of day). Hybrid approaches outperform pure methods on both accuracy and cold-start resilience, per Ricci et al.'s Recommender Systems Handbook (2022).

The two-tower neural architecture

The two-tower model is the dominant architecture for large-scale retrieval. Two neural networks encode users and items independently into a shared vector space. At inference, all item embeddings are pre-computed and stored in an approximate nearest neighbor index (e.g., ScaNN or FAISS). Given a user embedding, the retrieval step is an efficient nearest neighbor search over potentially billions of items.

The user tower typically takes interaction history, demographics, and session context as input. The item tower takes metadata, content embeddings, and popularity signals. The towers are trained jointly to maximize similarity between a user's embedding and embeddings of items they interacted with, and to minimize similarity with negative samples.

Yi et al. (2019), "Sampling-bias-corrected neural modeling for large corpus item recommendations" (Google, RecSys 2019), formalized the two-tower retrieval architecture. An earlier precursor is Covington et al. (2016), "Deep neural networks for YouTube recommendations" (RecSys 2016), which introduced the deep candidate-generation-and-ranking pipeline that two-tower retrieval later refined. The architecture is now standard at Google, Meta, Twitter, and most large-scale platforms.

LLM-based and generative recommendation

The newest frontier replaces or augments embedding-based systems with large language models. Two directions:

LLMs as feature encoders:
Use a pre-trained LLM to generate rich item embeddings from text descriptions, replacing hand-crafted content features. This improves zero-shot performance on new items.
Generative recommendation:
Frame recommendation as a sequence generation task. Meta's HSTU (Hierarchical Sequential Transduction Units) architecture treats all user actions as tokens in a generative model, enabling training at trillion-parameter scale. Wikipedia's entry on recommender systems notes that generative recommenders "improve recommendation quality in test simulations and in real-world tests, while being faster than previous Transformer-based systems when handling long lists of user actions."

As of 2025-2026, generative recommenders are in production at Meta and are being evaluated at other large platforms.

04. Key Terms and Methods

Term	Definition
Collaborative filtering	Recommendations based on patterns across users, not item content
Content-based filtering	Recommendations based on item features matched to user preferences
Matrix factorization	Decomposing the user-item matrix into latent factor vectors
Two-tower model	Dual neural network encoding users and items into a shared space
Cold-start problem	Inability to recommend for users or items with no interaction history
Retrieval stage	Fast approximate nearest-neighbor search to generate candidates
Ranking stage	Expensive, feature-rich scoring of retrieved candidates
Precision@k	Fraction of the top-k recommendations that are relevant
NDCG	Normalized Discounted Cumulative Gain. Measures ranking quality, weighting higher positions more heavily
Implicit feedback	Behavioral signals (clicks, views, dwell time) rather than explicit ratings
Generative recommendation	Framing recommendation as token-sequence generation

05. Examples

Netflix:
Multi-stage pipeline with offline matrix factorization and online ranking. The homepage is entirely personalized: genre rows, thumbnail selection, and item ordering all vary per user. Netflix architecture (2013 tech blog) describes offline batch jobs, nearline computation, and real-time online components.

Spotify:
Music streaming services use collaborative filtering to power playlist recommendations, combining listening history across millions of users. Natural language processing on playlist titles and song metadata adds content signals for new tracks.

Amazon:
Item-to-item collaborative filtering powers "customers who bought X also bought Y." The system computes item-item similarity offline and serves results online with very low latency.

YouTube:
Video platforms shifted from click-through rate to watch time and dwell time as ranking signals over time. The current YouTube system uses a two-tower retrieval model followed by a deep ranking network. Session-based signals (what has been watched in the current session) are particularly important.

06. Common Pitfalls and Misconceptions

Higher offline accuracy does not equal better user experience:
Researchers have documented an "accuracy barrier" in recommender systems: improvements in RMSE or precision@k beyond a point do not translate into better member satisfaction in A/B tests. Netflix found this directly when evaluating the Grand Prize ensemble from the Netflix Prize competition: the accuracy gains did not justify production deployment.

Collaborative filtering does not understand items:
It identifies patterns in behavior. If users who like chess documentaries also tend to watch cooking shows, the system will make that connection, but it has no understanding of why.

Cold start is not solved:
Hybrid systems mitigate it but do not eliminate it. A genuinely new item or user with no interaction history will always receive degraded recommendations relative to established entities.

Popularity bias:
Systems trained on interaction data tend to amplify already-popular items, reducing exposure for long-tail content. Diversity-aware ranking is an active research area.

Offline evaluation is unreliable:
Evaluations on held-out historical data have low correlation with online A/B test results, because the test data itself was influenced by the previous recommendation system. Results from offline evaluations should be treated with caution.