Running LLMs Locally

In Short

You can run a capable AI model on your own computer, for free and fully offline, using a program like Ollama or LM Studio. You install the program, download an open model (a file of several gigabytes), and start chatting. Nothing you type leaves your machine. The catch is that local models are smaller and weaker than the frontier models behind ChatGPT, and they need a reasonably capable computer to run at a comfortable speed. For privacy, offline use, and tinkering, it is a genuine option that has become much easier in the last year.

01. What It Is

Running an LLM locally means the model lives on your hard drive and runs on your own processor or graphics card, with no company server involved. Once the model is downloaded, it works with the internet switched off. This is only possible with open-weight models (models whose trained internals are released publicly so anyone can download them), such as Meta's Llama, Google's Gemma, Alibaba's Qwen, Mistral, Microsoft's Phi, DeepSeek, and OpenAI's open gpt-oss models. The frontier models behind the big chat apps are not available this way.

Two free programs do almost all the work for an ordinary user: Ollama and LM Studio. Both handle downloading models, loading them into memory, and giving you a chat window.

02. Why It Matters

Local models trade capability for control. What you get in return for accepting a smaller model is real: your prompts and documents never leave your computer, there is no monthly fee, it works offline on a plane or in a secure facility, and no company can change or remove the model from under you. For anyone handling sensitive data, anyone who wants to learn how these systems work hands-on, and anyone who simply does not want a subscription, local is the path. It used to require comfort with a command line. As of 2026 it does not.

03. How It Works

The basic idea

Install a runner program (Ollama or LM Studio).
Download a model. Models come as quantized files (compressed to run on normal hardware, explained below). A small model is a few gigabytes.
The program loads the model into your computer's memory and gives you a chat box.
You type, the model answers, all on your machine.

The first time, downloading and loading takes a while. After that, the model is ready whenever you open the program.

Ollama

Ollama is a free, open-source tool for downloading and running open models. It works on Windows, macOS, and Linux. On Windows you download an installer and run it like any normal program. On macOS you drag the app into your Applications folder.

Since mid-2025 Ollama ships with a proper desktop app with a chat window, so for everyday chatting you no longer have to use the terminal (the command-line version is still available). You pick a model, it downloads, and you chat. You can drag in a file or PDF to ask questions about it, and send images to vision-capable models. People who do use the command line type ollama run llama3.2 to download and start a model in one step.

Ollama also runs a small local server that speaks the same "language" as the OpenAI API. In plain terms, other AI apps and tools can be pointed at your computer instead of at a company, which lets you use a nicer chat interface or other software on top of your local model.

LM Studio

LM Studio is a free desktop app built around a graphical interface, which makes it especially friendly for non-coders. It runs on Windows, macOS (Apple Silicon), and Linux. There is no terminal step at all.

Inside the app, a "Discover" tab is connected to Hugging Face (the main library of open models). You search for a model by name, click Download, wait, then load it from the sidebar and chat. LM Studio also has a developer mode that starts a local server, so, like Ollama, it can sit underneath other tools.

What "quantized" and GGUF mean (in plain terms)

A model is a huge list of numbers. Stored at full quality, those numbers are large and need a lot of memory. Quantization shrinks each number to fewer digits of precision. The result is a much smaller file that needs far less memory, with only a small loss of quality. This is what makes it possible to run a useful model on an ordinary computer instead of a data centre.

You will see labels like Q4, Q8, and Q4_K_M:

Q8 keeps about 8 bits per number. Larger file, closest to the original quality.
Q4 keeps about 4 bits per number. Roughly half the size of Q8, with a small quality drop.
Q4_K_M is the most recommended everyday choice. It is a 4-bit version that keeps the most important parts of the model at higher quality. A typical 7-to-8-billion-parameter model at Q4_K_M is around 4 to 5 gigabytes and keeps most of its full-quality ability.

GGUF is simply the file format these quantized models come in, a single file holding the model and everything needed to run it. When downloading a model for Ollama or LM Studio, you are usually downloading a GGUF file at a chosen quantization.
For the deeper version of this topic, see quantization.

Which models can a normal person run

Open models come in sizes measured in billions of parameters (the "B" number). As a rough guide for a normal PC or Mac:

Small (about 1B to 8B):
easy. Runs on typical consumer hardware. This is where most people live. Examples: Llama 3.x 8B, Gemma 2/3, Qwen 8B, Phi, and the smaller DeepSeek-R1 distilled models.
Mid-size (about 13B to 34B):
needs a strong machine with plenty of memory.
Large (70B and up):
hard. Generally needs 48 gigabytes of memory or more, which means a high-end Mac or a multi-GPU setup, not a normal laptop.

OpenAI's open gpt-oss models exist too: gpt-oss-20b is designed for capable local machines (about 16 GB of memory), while gpt-oss-120b targets a single high-end 80 GB GPU.
How big a model your specific computer can handle is covered in hardware-and-performance.

04. Key Terms

Term	Plain meaning
Ollama	A free tool to download and run open models locally, now with a desktop chat app.
LM Studio	A free, graphical desktop app for downloading and chatting with local models. No terminal needed.
Open-weight model	A model whose internals are public, so you can download and run it yourself.
Quantization	Shrinking a model's numbers to fewer bits so it needs less memory, with a small quality cost.
GGUF	The single-file format that local models come in, used by Ollama and LM Studio.
Q4_K_M	A 4-bit quantization that balances size and quality. The common default for everyday local use.
Hugging Face	The main online library of open models, which LM Studio searches directly.
Local server	A feature that lets other apps connect to your local model, so you can use a better interface on top of it.

05. Examples

The simplest start:
Install Ollama, open the desktop app, pick a small model like Llama 3.2 or Gemma, let it download, and chat. No terminal, no account, no fee.
A friendlier, fully graphical route:
Install LM Studio, search "Qwen 8B Q4_K_M" in the Discover tab, download it, load it, and chat.
Asking a model about your own document offline:
Drag a PDF into the Ollama app and ask questions about it, with the model running entirely on your machine.
Using a local model inside a nicer interface:
Run Ollama's local server and point a separate chat front end at it.

06. Common Pitfalls / Misconceptions

"It will be as good as ChatGPT."
It will not. Local models are smaller open models. A good 8B model is genuinely useful for drafting, summarising, and answering questions, but it is not the frontier. Set expectations to "capable assistant," not "the strongest model in the world."

"I need to be a programmer."
Not anymore. Both Ollama and LM Studio now give you a normal chat window. The command line is optional.

"Bigger model number is always better, so I should download the largest."
Only if your computer can hold it. A model too large for your memory either refuses to run or crawls. Match the model size to your hardware first.
See hardware-and-performance.

"The first answer is slow, so it is broken."
The first run downloads several gigabytes and then loads the model into memory before answering. That one-time wait is normal. After loading, replies are much faster.

"A local model can look things up online."
By default it cannot. It answers only from what it learned during training and has no live web access unless you add extra tools. It also has no knowledge of events after its training cutoff.

"Lower quantization is basically free quality."
There is a floor. Very aggressive quantization (Q2, Q3) saves memory but the quality drop becomes noticeable. Q4_K_M is the usual sweet spot, with Q5, Q6, or Q8 if you have the memory to spare.