Limitations and Failure Modes of AI

In Short

AI language models are powerful but unevenly reliable. Because they predict likely text rather than looking up truth, they can state falsehoods confidently, miss exact math and counting, break on small prompt changes, and fail on problems unlike their training data. By default they have no memory and no knowledge past a fixed cutoff, and cannot reliably tell when they are wrong. Treat AI as a fast, fallible assistant whose work you verify, not an oracle.

01. What It Is

This file maps what today's AI, especially the large language models most people use, cannot reliably do and how it fails. These limits come from how current models work, so they age slowly even as benchmark scores move. It is not a takedown. The same systems also deliver large, measured gains inside their competence.
For how an LLM differs from other AI, see AI, ML, Deep Learning, LLMs, and Algorithms: The Differences.

02. Why It Matters

The failures are not random. They follow from how the technology works, so they are predictable. Most share one root cause. A language model predicts the next likely word-piece from patterns, it does not look facts up or run a proof, so fluent output reflects style, not accuracy. Knowing the edges captures the gains while avoiding harms like a fabricated citation. Verify checkable claims against an outside source, and for math or current facts give the model a tool.
For the mechanism, see What Is a Large Language Model?.

03. How It Works

It states false things with full confidence (hallucination)

A model can invent facts, quotes, and citations and present them as fact. As of 2026, hallucination rates for frontier models on mixed tasks run roughly 3 to 20 percent, and newer models hallucinate less but none reach zero, because the behavior is inherent to next-token prediction. OpenAI's research (Kalai et al., 2025) argues these errors persist because training and evaluation reward a confident guess over uncertainty. A model also cannot reliably correct its own reasoning without outside help (Huang et al., ICLR 2024), so confidence and correctness are unrelated.
See Hallucination, Grounding, and Guardrails.

It matches patterns, it does not understand

The model has no grounded world model behind the words. Apple's GSM-Symbolic study (Mirzadeh et al., ICLR 2025) rebuilt grade-school math problems from templates. Changing only the numbers lowered accuracy, and adding one irrelevant clause dropped it by up to 65 percent, so the authors conclude the models replicate training patterns rather than reasoning. In the Reversal Curse (Berglund et al., ICLR 2024), a model taught "A is B" often cannot answer "B is A" unless the fact is in the prompt. Even models built to reason collapse past a complexity threshold (Shojaee et al., 2025), though a rebuttal blames part of that on output limits and unsolvable test cases, so the fair reading is brittle reasoning, not none. The same edge appears on novelty tests like ARC-AGI-2, easy for ordinary people yet far above frontier scores.
See Reasoning Models and Test-Time Compute.

Small prompt changes swing the output (brittleness)

Cosmetic changes a person would treat as identical can swing results widely. Sclar et al. (ICLR 2024) tested meaning-preserving formatting, such as different separators, spacing, and casing, and saw accuracy spreads up to 76 points on LLaMA-2-13B and 56 points on GPT-3.5.
See Prompting Basics.

It is weak at exact math, counting, and letters

Models read text as sub-word tokens, not individual digits and letters, so exact counting and spelling are blind spots. The familiar case is "how many R's in strawberry," which trips models because the word is a few tokens, not nine separate characters they can count. A calculator or code tool fixes it.
See Tokens and Tokenization.

It has no live knowledge past a fixed cutoff

A model's weights freeze when training ends, so it has a knowledge cutoff after which it knows nothing unless a tool fetches it. Anthropic, for example, states that Claude "may not be aware of events" after its training cutoff. Ask about yesterday's news with no search tool and it will guess rather than report fact.
See Training vs. Inference and Retrieval-Augmented Generation (RAG).

Capability is uneven and the boundary is invisible (the jagged frontier)

The jagged frontier is the headline idea. A model can ace a very hard task and flunk an easy-looking one beside it, and the boundary is invisible from the inside. In a Harvard and BCG field experiment (Dell'Acqua et al., 2023), 758 consultants using GPT-4 on in-frontier tasks finished 12.2 percent more work, about 25 percent faster, with roughly 40 percent higher quality. On a task just outside the frontier, the same people were correct 84 percent of the time alone but only 60 to 70 percent with AI, pulled off course by a confident wrong answer.
See How to Use an LLM.

It reflects and can amplify bias in its training data

Trained on human text and images, models absorb and can amplify the social patterns in that data. A 2024 UNESCO study reported that GPT-2, GPT-3.5, and Llama 2 tied female names to words like "home" and "family" and male names to "career" and "executive," and placed women in domestic story roles about four times as often as men.
See AI Ethics, Bias, and Fairness.

It has no memory by default

Each request is processed on its own, and the model does not learn from your conversations. What feels like memory is the app re-injecting earlier turns or saved notes as tokens into each request. Remove that and the model is blank again, with no recall of you between sessions.
See Context Window.

It can read only so much, and misses the middle (context limits)

A model can take in only so much text at once, its context window, and it does not attend evenly across a long input. "Lost in the Middle" (Liu et al., TACL 2024) found accuracy is best when the needed fact sits at the start or end and drops when it is buried in the middle, even for long-context models.
See Context Window.

The same input can give different output (nondeterminism)

Identical settings do not guarantee identical answers. Thinking Machines Lab reported that the same prompt sent 1,000 times at temperature 0, the setting meant to remove randomness, produced about 80 distinct outputs, an effect of how requests are batched on the hardware. Reproducibility must be engineered, it is not free.
See Temperature and Sampling.

04. Key Terms

Term	Plain meaning
Hallucination	Stating something false as if it were fact, including invented quotes and citations.
Next-token prediction	Generating text one likely word-piece at a time from patterns rather than retrieving facts.
Jagged frontier	The uneven, invisible boundary of AI skill. Success on one task does not predict the next.
Knowledge cutoff	The date after which the model has no built-in knowledge, so anything newer is a guess.
Stateless / no memory	The model holds nothing between requests. Apparent memory is the app re-sending earlier text.
Out-of-distribution	A problem unlike anything in training data. Models are strong on the familiar, weak on the new.
Nondeterminism	Different outputs from the same input and settings, so reproducibility must be engineered.

05. Examples

Counting letters:
Asked how many R's are in "strawberry," models often miss, seeing the word as a few tokens, not nine separate letters.

Confident and wrong:
Asked for one researcher's dissertation title, a chatbot gave three different answers, all wrong, and three wrong birthdays (Kalai et al., 2025).

Same input, different output:
At temperature 0, the same prompt run 1,000 times gave about 80 distinct answers (Thinking Machines Lab, 2025).

06. Common Misconceptions

"Hallucination is a bug the next model will fix."
It is inherent to next-token prediction, and the rate never reaches zero because training still rewards confident guessing over saying "I don't know."

"It can do my taxes, so it can count the letters in a word."
Capability is jagged, and a model can handle a sophisticated task yet miss a trivial one because it reads sub-word tokens, not letters.

"Just ask it if it's sure, and it will catch the mistake."
Self-checking without outside information often fails or makes the answer worse, so real verification means an independent source or tool.

"It remembers our past chats and learns from me."
By default it does not. The weights are frozen, and memory features merely store and replay your text.

"It knows what is happening in the world right now."
Only up to its cutoff and only if a search tool is attached, otherwise it returns a plausible guess rather than current fact.