NLP Fundamentals

Foundations 9 min read Updated 23 Jun 2026

In Short

Natural language processing went through a series of distinct technological generations before the transformer era. Understanding the pre-transformer stack (tokenization, bag-of-words, TF-IDF, n-grams, word embeddings, RNNs) shows why attention mechanisms were such a breakthrough and what problems they actually solved.

01. What It Is

Natural language processing is the subfield of AI concerned with enabling computers to understand, generate, and manipulate human language. The core challenge is that language is discrete, ambiguous, context-dependent, and reflects world knowledge that no formal grammar can fully specify.

The pre-transformer NLP pipeline, dominant from roughly the 1990s to 2017, was a sequence of components that progressively converted raw text into structured numerical representations. Each stage addressed a different aspect of the translation from human language to machine-processable features.

02. Why It Matters

Every modern language model builds on the concepts that emerged in the pre-transformer era. Tokenization still happens before every transformer forward pass. Word embeddings were the direct conceptual predecessor of the contextual embeddings that transformers produce. The failures of RNNs and LSTMs are precisely what the attention mechanism was designed to address. Knowing the history clarifies why current architectures make the choices they do.

The pre-transformer approaches are also still in production. Sentiment analysis pipelines at many companies use TF-IDF with logistic regression or gradient boosting. Search engines use TF-IDF variants at scale. Named entity recognition still ships in spaCy, NLTK, and Stanford NLP tools for tasks where full transformer inference would be too slow or expensive.

03. How It Works

Tokenization

Before any processing, raw text is split into tokens. A token is the unit of processing: in classical NLP it is typically a word or a punctuation mark. In modern transformer systems it is a subword unit (see the tokens-and-tokenization file), but the principle is the same: convert a string into a sequence of discrete symbols that a model can process.

Word tokenization must handle contractions ("don't" = one token or two?), hyphenation, languages that do not use spaces (Chinese, Japanese), and special characters. Tokenization decisions affect everything downstream.

Stemming and lemmatization

Both reduce words to a common base form to collapse morphological variants. "Running," "runs," and "ran" should all map to the same concept.

Stemming is a heuristic process that chops word endings using rules. The Porter stemmer (1980) reduces "running" to "run" and "studies" to "studi." It is fast but produces non-words.

Lemmatization uses a morphological dictionary and grammatical analysis to return the actual dictionary form (the lemma). "Studies" maps to "study," "better" maps to "good." Lemmatization is more accurate but slower and language-specific.

Bag-of-words (BoW)

The bag-of-words model represents a document as a vector of word counts, ignoring order and grammar entirely. A vocabulary of 50,000 words produces a 50,000-dimensional sparse vector for each document, with each dimension holding the count of that word. Despite discarding all sequential information, bag-of-words is a surprisingly effective baseline for document classification and topic detection.

The limitations are significant: two sentences with identical words in different orders produce identical BoW vectors. "The dog bit the man" and "The man bit the dog" are the same document under BoW.

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) addresses a key weakness of raw word counts: common words like "the," "is," and "a" appear in every document and carry no discriminative signal. TF-IDF weights each word's count (term frequency) by the inverse of how many documents in the corpus contain that word (inverse document frequency). Words that are frequent in a specific document but rare across the corpus get high TF-IDF scores, which makes them the most informative terms.

TF-IDF is still the default representation in many production search engines. It is fast to compute, interpretable, and effective for keyword-based retrieval.

N-grams

An n-gram is a contiguous sequence of n tokens. Unigrams are individual words. Bigrams are two-word sequences. Trigrams are three-word sequences. Adding bigrams and trigrams to a bag-of-words representation captures some local phrase structure. "New York" as a bigram carries different meaning than "new" and "York" as separate unigrams.

Language models in the pre-neural era were n-gram language models: they estimated the probability of the next word given the previous n-1 words using count-based statistics from a corpus. Google's web-scale n-gram dataset (published in 2006) was a landmark resource that enabled statistical machine translation and spell correction at scale.

Word embeddings: Word2Vec and GloVe

Classical BoW and TF-IDF vectors are sparse (mostly zeros) and carry no information about the relationship between words: "cat" and "kitten" are as different as "cat" and "justice" in a BoW representation.

Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict a word from its context (or a context from a word). The hidden layer weights become dense word vectors where semantically similar words cluster together. The famous property: vector("king") - vector("man") + vector("woman") approximates vector("queen"). Word2Vec vectors are typically 100-300 dimensions, trained once on a large corpus, and reused across tasks.

GloVe (Pennington et al., Stanford, 2014) takes a different approach: it trains on global word-word co-occurrence statistics across the entire corpus, combining the strengths of count-based methods (like LSA) with the predictive methods of Word2Vec. Both produce qualitatively similar vector spaces. GloVe tends to perform slightly better on word analogy tasks.

The critical limitation of both: each word has a single vector regardless of context. "Bank" in "river bank" and "bank" in "bank account" share a single representation that conflates both meanings. This is exactly what contextual embeddings from transformers fix.

Named entity recognition (NER)

NER identifies and classifies named entities in text: people (PER), organizations (ORG), locations (LOC), dates, monetary values, and other categories. "Apple acquired Beats for $3 billion" should produce [Apple]ORG, [Beats]ORG, [$3 billion]MONEY.

Pre-transformer NER used sequence labeling models (CRF, BiLSTM-CRF) that assigned a label to each token based on its features and the labels of neighboring tokens. Transformer-based NER (using BERT or similar) dramatically improved accuracy by providing rich contextual representations.

Sentiment analysis

Sentiment analysis classifies the emotional polarity of text: positive, negative, or neutral. Rule-based approaches use word lists (VADER, LIWC). ML approaches train a classifier on labeled reviews or social media posts. Sentiment analysis is one of the most commercially deployed NLP tasks, used in customer feedback analysis, brand monitoring, and financial news processing.

The challenge is irony, negation, and domain specificity. "Great, just what I needed, another bug" is negative. "Not bad" is positive. Domain shift is severe: a word like "sick" is negative in medical records and positive in skateboard culture.

Part-of-speech (POS) tagging

POS tagging labels each word in a sentence with its grammatical role: noun, verb, adjective, adverb, preposition, determiner, etc. "The quick brown fox jumps over the lazy dog" maps to DET ADJ ADJ NOUN VERB PREP DET ADJ NOUN. POS tags are used as features in downstream tasks including NER, dependency parsing, and machine translation. Pre-transformer taggers used Hidden Markov Models and Conditional Random Fields. Transformer-based taggers achieve over 97% accuracy on standard benchmarks.

From RNNs and LSTMs to transformers

The dominant NLP architecture from 2015 to 2017 was the Recurrent Neural Network (RNN) and its gated variant, Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber, 1997). An RNN processes tokens sequentially, maintaining a hidden state that is updated at each step. LSTMs added gating mechanisms (input, forget, and output gates) to control what information flows through the hidden state, enabling the model to remember relevant information over longer sequences.

RNNs had two critical limitations. First, the hidden state is a fixed-size vector that must compress all prior context, making it hard to retain information from far back in a long sequence. Second, sequential processing means each timestep must wait for the previous one, making parallelization across sequence positions impossible and training on long texts slow.

The transformer (Vaswani et al., "Attention Is All You Need," 2017) replaced recurrence with self-attention: every token can directly attend to every other token in the sequence in a single layer, with no compression bottleneck and no sequential dependency. This allowed full parallelization during training and solved the long-range dependency problem. The NLP stack described above became, almost overnight, a historical baseline rather than a production approach.

04. Key Terms

Token:
The basic unit of text processing. A word, subword, or character depending on the system. Stemming: Heuristic reduction of words to a root form by removing suffixes. Lemmatization: Reduction of words to their dictionary form using morphological analysis. Bag-of-words: A document represented as a vector of word counts, ignoring order. TF-IDF: Term frequency weighted by inverse document frequency, emphasizing rare and distinctive words. N-gram: A contiguous sequence of n tokens. Used to capture local phrase context. Word2Vec: A shallow neural network trained to predict words from context, producing dense word embeddings. GloVe: Word embeddings trained on global co-occurrence statistics from Stanford. NER: Named entity recognition. Identifying and classifying named entities in text. POS tagging: Part-of-speech tagging. Labeling each token with its grammatical role. LSTM: Long short-term memory. A gated recurrent architecture that addressed the vanishing gradient problem in plain RNNs. Sentiment analysis: Classification of text by emotional polarity (positive, negative, neutral). Contextual embedding: A word representation that changes based on surrounding context, as produced by BERT and later transformers.

05. Examples

A search engine indexes 10 billion documents using TF-IDF. When a user queries "machine learning tutorial," TF-IDF retrieves documents where "machine learning" and "tutorial" are distinctive terms, not documents that merely mention "machine" in an unrelated context.

A pre-transformer chatbot for customer support uses a bag-of-words classifier to route incoming messages to the right department. "I can't log in" maps to authentication, "my order hasn't arrived" maps to shipping. The model is interpretable: the highest-TF-IDF terms for each class are visible and auditable.

Word2Vec embeddings trained on a financial news corpus reveal that "bull" and "bear" are near-synonyms for market direction, that "Fed" and "Federal Reserve" are nearly identical in that space, and that company names cluster by sector.

06. Common Pitfalls and Misconceptions

"Word2Vec understands language."
Word2Vec learns statistical co-occurrence patterns. It does not understand meaning. The "king - man + woman = queen" arithmetic works because of distributional regularities in training text, not semantic reasoning.

"Transformers made the old stack obsolete for everything."
Transformers are resource-intensive. For low-latency, high-throughput production systems processing simple classification tasks, TF-IDF + logistic regression is still widely deployed because it is 100x faster and uses a fraction of the memory.

"Tokenization is trivial."
Tokenization decisions have significant downstream effects on model performance, especially for multilingual systems, code, and rare-word handling. The choice of tokenizer and vocabulary size affects how well a model handles morphologically rich languages.

"Sentiment analysis works out of the box."
Sentiment systems trained on movie reviews perform poorly on medical records, legal documents, or social media slang. Domain adaptation is almost always required.