03. How It Works
The basic idea
- Install a runner program (Ollama or LM Studio).
- Download a model. Models come as quantized files (compressed to run on normal hardware, explained below). A small model is a few gigabytes.
- The program loads the model into your computer's memory and gives you a chat box.
- You type, the model answers, all on your machine.
The first time, downloading and loading takes a while. After that, the model is ready whenever you open the program.
Ollama
Ollama is a free, open-source tool for downloading and running open models. It works on Windows, macOS, and Linux. On Windows you download an installer and run it like any normal program. On macOS you drag the app into your Applications folder.
Since mid-2025 Ollama ships with a proper desktop app with a chat window, so you no longer need the terminal. You pick a model, it downloads, and you chat. You can drag in a file or PDF to ask questions about it, and send images to vision-capable models. People who do use the command line type ollama run llama3.2 to download and start a model in one step.
Ollama also runs a small local server that speaks the same "language" as the OpenAI API. In plain terms, other AI apps and tools can be pointed at your computer instead of at a company, which lets you use a nicer chat interface or other software on top of your local model.
LM Studio
LM Studio is a free desktop app built around a graphical interface, which makes it especially friendly for non-coders. It runs on Windows, macOS (Apple Silicon), and Linux. There is no terminal step at all.
Inside the app, a "Discover" tab is connected to Hugging Face (the main library of open models). You search for a model by name, click Download, wait, then load it from the sidebar and chat. LM Studio also has a developer mode that starts a local server, so, like Ollama, it can sit underneath other tools.
What "quantized" and GGUF mean (in plain terms)
A model is a huge list of numbers. Stored at full quality, those numbers are large and need a lot of memory. Quantization shrinks each number to fewer digits of precision. The result is a much smaller file that needs far less memory, with only a small loss of quality. This is what makes it possible to run a useful model on an ordinary computer instead of a data centre.
You will see labels like Q4, Q8, and Q4_K_M:
- Q8 keeps about 8 bits per number. Larger file, closest to the original quality.
- Q4 keeps about 4 bits per number. Roughly half the size of Q8, with a small quality drop.
- Q4_K_M is the most recommended everyday choice. It is a 4-bit version that keeps the most important parts of the model at higher quality. A typical 7-to-8-billion-parameter model at Q4_K_M is around 4 to 5 gigabytes and keeps most of its full-quality ability.
GGUF is simply the file format these quantized models come in, a single file holding the model and everything needed to run it. When downloading a model for Ollama or LM Studio, you are usually downloading a GGUF file at a chosen quantization.
For the deeper version of this topic, see quantization.
Which models can a normal person run
Open models come in sizes measured in billions of parameters (the "B" number). As a rough guide for a normal PC or Mac:
- Small (about 1B to 8B): easy. Runs on typical consumer hardware. This is where most people live. Examples: Llama 3.x 8B, Gemma 2/3, Qwen 8B, Phi, and the smaller DeepSeek-R1 distilled models.
- Mid-size (about 13B to 34B): needs a strong machine with plenty of memory.
- Large (70B and up): hard. Generally needs 48 gigabytes of memory or more, which means a high-end Mac or a multi-GPU setup, not a normal laptop.
OpenAI's open gpt-oss models exist too: gpt-oss-20b is designed for capable local machines (about 16 GB of memory), while gpt-oss-120b targets a single high-end 80 GB GPU.
How big a model your specific computer can handle is covered in hardware-and-performance.