03. How It Works
Tokenization
Before any processing, raw text is split into tokens. A token is the unit of processing: in classical NLP it is typically a word or a punctuation mark. In modern transformer systems it is a subword unit (see the tokens-and-tokenization file), but the principle is the same: convert a string into a sequence of discrete symbols that a model can process.
Word tokenization must handle contractions ("don't" = one token or two?), hyphenation, languages that do not use spaces (Chinese, Japanese), and special characters. Tokenization decisions affect everything downstream.
Stemming and lemmatization
Both reduce words to a common base form to collapse morphological variants. "Running," "runs," and "ran" should all map to the same concept.
Stemming is a heuristic process that chops word endings using rules. The Porter stemmer (1980) reduces "running" to "run" and "studies" to "studi." It is fast but produces non-words.
Lemmatization uses a morphological dictionary and grammatical analysis to return the actual dictionary form (the lemma). "Studies" maps to "study," "better" maps to "good." Lemmatization is more accurate but slower and language-specific.
Bag-of-words (BoW)
The bag-of-words model represents a document as a vector of word counts, ignoring order and grammar entirely. A vocabulary of 50,000 words produces a 50,000-dimensional sparse vector for each document, with each dimension holding the count of that word. Despite discarding all sequential information, bag-of-words is a surprisingly effective baseline for document classification and topic detection.
The limitations are significant: two sentences with identical words in different orders produce identical BoW vectors. "The dog bit the man" and "The man bit the dog" are the same document under BoW.
TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) addresses a key weakness of raw word counts: common words like "the," "is," and "a" appear in every document and carry no discriminative signal. TF-IDF weights each word's count (term frequency) by the inverse of how many documents in the corpus contain that word (inverse document frequency). Words that are frequent in a specific document but rare across the corpus get high TF-IDF scores, which makes them the most informative terms.
TF-IDF is still the default representation in many production search engines. It is fast to compute, interpretable, and effective for keyword-based retrieval.
N-grams
An n-gram is a contiguous sequence of n tokens. Unigrams are individual words. Bigrams are two-word sequences. Trigrams are three-word sequences. Adding bigrams and trigrams to a bag-of-words representation captures some local phrase structure. "New York" as a bigram carries different meaning than "new" and "York" as separate unigrams.
Language models in the pre-neural era were n-gram language models: they estimated the probability of the next word given the previous n-1 words using count-based statistics from a corpus. Google's web-scale n-gram dataset (published in 2006) was a landmark resource that enabled statistical machine translation and spell correction at scale.
Word embeddings: Word2Vec and GloVe
Classical BoW and TF-IDF vectors are sparse (mostly zeros) and carry no information about the relationship between words: "cat" and "kitten" are as different as "cat" and "justice" in a BoW representation.
Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict a word from its context (or a context from a word). The hidden layer weights become dense word vectors where semantically similar words cluster together. The famous property: vector("king") - vector("man") + vector("woman") approximates vector("queen"). Word2Vec vectors are 100-300 dimensions, trained once on a large corpus, and reused across tasks.
GloVe (Pennington et al., Stanford, 2014) takes a different approach: it trains on global word-word co-occurrence statistics across the entire corpus, combining the strengths of count-based methods (like LSA) with the predictive methods of Word2Vec. Both produce qualitatively similar vector spaces; GloVe tends to perform slightly better on word analogy tasks.
The critical limitation of both: each word has a single vector regardless of context. "Bank" in "river bank" and "bank" in "bank account" share a single representation that conflates both meanings. This is exactly what contextual embeddings from transformers fix.
Named entity recognition (NER)
NER identifies and classifies named entities in text: people (PER), organizations (ORG), locations (LOC), dates, monetary values, and other categories. "Apple acquired Beats for $3 billion" should produce [Apple]ORG, [Beats]ORG, [$3 billion]MONEY.
Pre-transformer NER used sequence labeling models (CRF, BiLSTM-CRF) that assigned a label to each token based on its features and the labels of neighboring tokens. Transformer-based NER (using BERT or similar) dramatically improved accuracy by providing rich contextual representations.
Sentiment analysis
Sentiment analysis classifies the emotional polarity of text: positive, negative, or neutral. Rule-based approaches use word lists (VADER, LIWC). ML approaches train a classifier on labeled reviews or social media posts. Sentiment analysis is one of the most commercially deployed NLP tasks, used in customer feedback analysis, brand monitoring, and financial news processing.
The challenge is irony, negation, and domain specificity. "Great, just what I needed, another bug" is negative. "Not bad" is positive. Domain shift is severe: a word like "sick" is negative in medical records and positive in skateboard culture.
Part-of-speech (POS) tagging
POS tagging labels each word in a sentence with its grammatical role: noun, verb, adjective, adverb, preposition, determiner, etc. "The quick brown fox jumps over the lazy dog" maps to DET ADJ ADJ NOUN VERB PREP DET ADJ NOUN. POS tags are used as features in downstream tasks including NER, dependency parsing, and machine translation. Pre-transformer taggers used Hidden Markov Models and Conditional Random Fields. Transformer-based taggers achieve over 97% accuracy on standard benchmarks.
From RNNs and LSTMs to transformers
The dominant NLP architecture from 2015 to 2017 was the Recurrent Neural Network (RNN) and its gated variant, Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber, 1997). An RNN processes tokens sequentially, maintaining a hidden state that is updated at each step. LSTMs added gating mechanisms (input, forget, and output gates) to control what information flows through the hidden state, enabling the model to remember relevant information over longer sequences.
RNNs had two critical limitations. First, the hidden state is a fixed-size vector that must compress all prior context, making it hard to retain information from far back in a long sequence. Second, sequential processing means each timestep must wait for the previous one, making parallelization across sequence positions impossible and training on long texts slow.
The transformer (Vaswani et al., "Attention Is All You Need," 2017) replaced recurrence with self-attention: every token can directly attend to every other token in the sequence in a single layer, with no compression bottleneck and no sequential dependency. This allowed full parallelization during training and solved the long-range dependency problem. The NLP stack described above became, almost overnight, a historical baseline rather than a production approach.