03. How It Works
It states false things with full confidence (hallucination)
A model can invent facts, quotes, and citations and present them as fact. As of 2026, hallucination rates for frontier models on mixed tasks run roughly 3 to 20 percent, and newer models hallucinate less but none reach zero, because the behavior is inherent to next-token prediction. OpenAI's research (Kalai et al., 2025) argues these errors persist because training and evaluation reward a confident guess over uncertainty. A model also cannot reliably correct its own reasoning without outside help (Huang et al., ICLR 2024), so confidence and correctness are unrelated.
See Hallucination, Grounding, and Guardrails.
It matches patterns, it does not understand
The model has no grounded world model behind the words. Apple's GSM-Symbolic study (Mirzadeh et al., ICLR 2025) rebuilt grade-school math problems from templates. Changing only the numbers lowered accuracy, and adding one irrelevant clause dropped it by up to 65 percent, so the authors conclude the models replicate training patterns rather than reasoning. In the Reversal Curse (Berglund et al., ICLR 2024), a model taught "A is B" often cannot answer "B is A" unless the fact is in the prompt. Even models built to reason collapse past a complexity threshold (Shojaee et al., 2025), though a rebuttal blames part of that on output limits and unsolvable test cases, so the fair reading is brittle reasoning, not none. The same edge appears on novelty tests like ARC-AGI-2, easy for ordinary people yet far above frontier scores.
See Reasoning Models and Test-Time Compute.
Small prompt changes swing the output (brittleness)
Cosmetic changes a person would treat as identical can swing results widely. Sclar et al. (ICLR 2024) tested meaning-preserving formatting, such as different separators, spacing, and casing, and saw accuracy spreads up to 76 points on LLaMA-2-13B and 56 points on GPT-3.5.
See Prompting Basics.
It is weak at exact math, counting, and letters
Models read text as sub-word tokens, not individual digits and letters, so exact counting and spelling are blind spots. The familiar case is "how many R's in strawberry," which trips models because the word is a few tokens, not nine separate characters they can count. A calculator or code tool fixes it.
See Tokens and Tokenization.
It has no live knowledge past a fixed cutoff
A model's weights freeze when training ends, so it has a knowledge cutoff after which it knows nothing unless a tool fetches it. Anthropic, for example, states that Claude "may not be aware of events" after its training cutoff. Ask about yesterday's news with no search tool and it will guess rather than report fact.
See Training vs. Inference and Retrieval-Augmented Generation (RAG).
Capability is uneven and the boundary is invisible (the jagged frontier)
The jagged frontier is the headline idea. A model can ace a very hard task and flunk an easy-looking one beside it, and the boundary is invisible from the inside. In a Harvard and BCG field experiment (Dell'Acqua et al., 2023), 758 consultants using GPT-4 on in-frontier tasks finished 12.2 percent more work, about 25 percent faster, with roughly 40 percent higher quality. On a task just outside the frontier, the same people were correct 84 percent of the time alone but only 60 to 70 percent with AI, pulled off course by a confident wrong answer.
See How to Use an LLM.
It reflects and can amplify bias in its training data
Trained on human text and images, models absorb and can amplify the social patterns in that data. A 2024 UNESCO study reported that GPT-2, GPT-3.5, and Llama 2 tied female names to words like "home" and "family" and male names to "career" and "executive," and placed women in domestic story roles about four times as often as men.
See AI Ethics, Bias, and Fairness.
It has no memory by default
Each request is processed on its own, and the model does not learn from your conversations. What feels like memory is the app re-injecting earlier turns or saved notes as tokens into each request. Remove that and the model is blank again, with no recall of you between sessions.
See Context Window.
It can read only so much, and misses the middle (context limits)
A model can take in only so much text at once, its context window, and it does not attend evenly across a long input. "Lost in the Middle" (Liu et al., TACL 2024) found accuracy is best when the needed fact sits at the start or end and drops when it is buried in the middle, even for long-context models.
See Context Window.
The same input can give different output (nondeterminism)
Identical settings do not guarantee identical answers. Thinking Machines Lab reported that the same prompt sent 1,000 times at temperature 0, the setting meant to remove randomness, produced about 80 distinct outputs, an effect of how requests are batched on the hardware. Reproducibility must be engineered, it is not free.
See Temperature and Sampling.