Hardware and Performance for Local LLMs

In Short

Whether a model runs on your computer comes down to one thing: memory. On a Windows or Linux PC that means the graphics card's VRAM. On an Apple Silicon Mac it means the shared "unified memory." The whole model has to fit. As a rough guide at the common 4-bit quantization, an 8-billion-parameter model needs about 4 to 6 GB, a 14B needs about 8 to 10 GB, a 34B about 20 GB, and a 70B about 40 GB or more. Speed is measured in tokens (word-pieces) per second. A small model on a decent graphics card or a modern Mac runs faster than you can read. The same model on a laptop with no graphics card works but is slow.

Snapshot caveat:
Specific graphics cards, Mac chips, and memory figures reflect mid-2026 hardware. The memory rules of thumb are stable, the example hardware will date. Treat all numbers as approximate ballparks.

01. What It Is

This file answers the practical question behind running-llms-locally: what computer do I need, and how fast will it be? Two ideas cover almost everything. First, memory decides whether a model can run at all. Second, tokens per second measures how fast it answers. Get the memory right and a model runs. Get enough speed and it feels responsive instead of sluggish.

02. Why It Matters

People waste money in both directions. Some assume they need an expensive workstation to run any local AI, when a mid-range gaming PC or a 16 GB Mac runs a capable model comfortably. Others try to load a 70-billion-parameter model onto a laptop that cannot hold it and conclude local AI "does not work," when the real problem was a mismatch between model size and memory. Knowing the memory rule and the speed expectation lets you pick a model your machine can actually run, and tells you what to buy if you want to run something bigger.

03. How It Works

Memory is the main constraint

A model is a list of numbers called parameters, counted in billions (a "7B" model has about 7 billion). To run, the whole model must load into fast memory. On a PC that fast memory is the VRAM on the graphics card. On an Apple Silicon Mac there is no separate graphics card, the processor and graphics share one unified memory pool, and a large share of it can go to the model. This is why a 64 GB Mac can run models that no single consumer graphics card can hold.

How much memory a model needs is its parameter count times how many bits each parameter uses. Full quality is about 16 bits (2 bytes) per parameter. Quantization to 4 bits (about half a byte) cuts that by roughly three-quarters with only a small quality loss, which is why local models are almost always run quantized.
See running-llms-locally and quantization.

A quick mental shortcut at 4-bit: the memory in gigabytes is roughly the model's billions-of-parameters number divided by about two, then add a few gigabytes of headroom for the program, the operating system, and the conversation history.

Rule of thumb: memory by model size (4-bit)

These are the loaded-model ballparks. Leave a few gigabytes of headroom on top.

Model size	Approx. memory at 4-bit	Comfortable on
7B to 8B	about 4 to 6 GB	an 8 GB graphics card, or a 16 GB Mac
13B to 14B	about 8 to 10 GB	a 12 GB graphics card
30B to 34B	about 18 to 22 GB	a 24 GB graphics card, or a 32 GB+ Mac
70B	about 35 to 40+ GB	a 64 GB Mac, or a multi-GPU setup

For comparison, a 70B model at full 16-bit quality would need about 140 GB, versus about 38 GB at 4-bit. That gap is exactly why quantization matters.

What typical hardware can run

An ordinary laptop with no dedicated graphics card:
It works, using the main processor and system RAM, but it is slow. Small models (7B to 8B at 4-bit) are usable for tasks where you do not need an instant reply. Expect single-digit speed.

A gaming PC with an NVIDIA graphics card, by VRAM:

8 GB (entry level): runs 7B to 8B models comfortably. The practical starting point.
12 GB:
all 7B models at higher quality, and most 13B models.
16 GB:
comfortable for 13B to 14B, reaching into the low 20-billions.
24 GB (high end): runs 30B to 34B models with room for context. A 70B only by spilling into system RAM, which slows it down sharply.

Apple Silicon Macs, by unified memory:

16 GB:
handles 7B to 8B models. The practical minimum.
32 GB:
comfortable for 14B to 32B models.
64 GB and up:
can run 70B models, which a single consumer graphics card cannot. Only part of total unified memory is usable for the model (about 70 to 75 percent on 64 GB and larger Macs, closer to two-thirds on smaller ones), the rest goes to the system.

Performance: tokens per second

Speed is measured in tokens per second. A token is a chunk of text, roughly three-quarters of a word, so tokens per second is close to a slightly inflated words per second. For reference, comfortable human reading is about 250 words a minute, which is only around 6 tokens per second. Anything above roughly 10 tokens per second already outpaces reading.

Small model on a modern graphics card:
very fast. An 8B model at 4-bit can run roughly 95 to 120 tokens per second on a top consumer card, and entry and mid-range cards still comfortably exceed reading speed, often 30 to 80+ tokens per second.
CPU only (no graphics card):
much slower, typically a few to around 15 tokens per second for a 7B model, often around ten times slower than a good graphics card.
Bigger models are slower on the same hardware, because every single token generated requires reading more parameters from memory.

Apple Silicon speed is governed by memory bandwidth rather than raw graphics power, which is why Macs generate text quickly for small models despite having no separate graphics card.

04. Key Terms

Term	Plain meaning
VRAM	The fast memory on a graphics card. The key number for running models on a PC.
Unified memory	On Apple Silicon Macs, one memory pool shared by the processor and graphics. A large share can run a model.
System RAM	The computer's main memory. Usable for models on the CPU, but much slower than VRAM.
Parameter count (the "B" number)	The model's size in billions of parameters. Bigger means more memory and slower.
Quantization	Compressing the model to fewer bits per parameter so it needs less memory. 4-bit is the common default.
Tokens per second	The speed of generation. Above ~10 already feels faster than reading.
Time to first token	The pause before the first word appears, while the model reads your prompt.
Headroom	Spare memory for the program, the operating system, and the conversation, on top of the model itself.

05. Examples

A 16 GB MacBook:
Runs an 8B model comfortably and faster than you read. A solid, common starting setup.
A gaming PC with a 12 GB graphics card:
Runs 7B models at high quality and most 13B models. A good value sweet spot.
A 24 GB graphics card:
Runs 30B to 34B models well. Near the ceiling for a single consumer card.
A 64 GB Mac:
Can run a 70B model, something no single consumer graphics card can do, thanks to unified memory.
An old laptop with no graphics card:
Can still run a small model on the processor, slowly. Fine for patient, occasional use.

06. Common Pitfalls / Misconceptions

"VRAM and system RAM are the same."
They are not. VRAM is the fast memory on the graphics card and is what matters for fast local AI. System RAM is much slower for this. A PC with 64 GB of system RAM but an 8 GB graphics card is still an "8 GB" machine for GPU inference. On a Mac, unified memory plays both roles.

"More parameters is simply better, so buy the biggest model."
A model too big for your memory will not run, or will run painfully slowly by spilling into slower memory. Match the model to the hardware first.

"Running on the processor is just a slightly slower option."
It is often around ten times slower. Fine for occasional, non-urgent use, frustrating for back-and-forth chat. A graphics card or Apple Silicon makes the real difference.

"It will be slow like the old days."
A correctly sized small model on modern hardware runs faster than you can read. Slowness almost always means the model was too large for the memory, or it is running on the processor alone.

"I can run the same giant models the big companies use."
You cannot. Frontier models have hundreds of billions of parameters and need hundreds of gigabytes of memory. Local models top out in the tens of billions for most people, and around 70B on a high-end Mac.

"Laptops handle this for free."
Sustained local AI pushes the chip hard, which drains battery quickly and produces heat and fan noise. It is gentler on a desktop or a plugged-in machine.