03. How It Works
Pricing mechanics
LLM APIs charge per token. Input tokens (your prompt) and output tokens (the model's response) are priced separately. Output tokens cost more because they require a full autoregressive forward pass per token, while input tokens can be processed in parallel.
Token pricing tiers (approximate, as of June 2026):
| Tier |
Examples |
Input ($/1M) |
Output ($/1M) |
| Budget |
GPT-5.4 mini ($0.75), Gemini 3 Flash, Mistral Small 4 |
$0.10-0.80 |
$0.30-4.50 |
| Mid / Fast |
Claude Haiku 4.5 ($1.00), Gemini 3.5 Flash |
$1.00-1.50 |
$4.50-5.00 |
| Production |
Claude Sonnet 4.6 ($3.00), GPT-5.4 ($2.50) |
$2.50-3.00 |
$15.00 |
| Frontier |
Claude Opus 4.8 ($5.00), GPT-5.5 ($5.00) |
$5.00 |
$25.00-30.00 |
| Open-weight via API |
DeepSeek V4 Pro ($0.435), DeepSeek V4 Flash ($0.14) |
$0.14-0.44 |
$0.28-0.87 |
Prices verified at anthropic.com/pricing and openai.com/api/pricing/ as of June 2026. Verify before quoting as these change.
Open-weight via API: DeepSeek V4 Pro at $0.435/$0.87 is in a class of its own for frontier-quality capability at budget pricing.
One practical note: the "input" to a real API call is rarely just the user's message. A typical RAG application has a 50-token user query embedded in a 4,000-token context with retrieved documents and a system prompt. The full payload, not the user message, is what you are paying for.
Prompt caching
When you send the same prefix (system prompt, few-shot examples, retrieved context) across many requests, you are paying to process it every time. Prompt caching stores the KV-cache (the internal state computed from the prefix) and reuses it for subsequent calls with the same prefix.
Anthropic and OpenAI both offer prompt caching at roughly 50% off for cached input tokens (Anthropic offers up to 90% off on their largest cache tier). For applications where a long static system prompt or document is reused across many calls, caching can reduce effective input costs by 40-90%.
Implementation: structure your prompt so the static prefix comes first and the variable content (user message, dynamic context) comes last. Cache invalidation is handled automatically; you pay full price when the prefix changes.
Batch processing
For non-real-time workloads (nightly document processing, bulk evaluation, data extraction pipelines), both OpenAI and Anthropic offer a 50% discount via their batch APIs. You submit a file of requests, they are processed within 24 hours, and you retrieve the results.
Batch and caching stack. A batch request with a cached system prompt can cost roughly 25% of the standard real-time rate.
Tiered routing strategy
Routing sends different requests to different models based on complexity, reducing average cost without reducing quality on hard tasks.
A simple routing strategy:
- If the request is a simple lookup, FAQ response, or short classification: route to Gemini 3 Flash or Claude Haiku 4.5 ($0.10-1.00 input).
- If the request is moderate complexity (multi-turn reasoning, summarization): route to Claude Sonnet or Gemini 3.5 Flash.
- Only if the request requires frontier reasoning (complex code, research synthesis, agentic task): route to Claude Opus or GPT-5.5.
Teams report 60-80% cost reduction with negligible quality impact using this strategy. Building a classifier to automate routing is itself a small LLM call, but a cheap one.
Context compression
Long contexts cost more. Before sending a 50,000-token document, consider whether you need it all. Chunking and embedding-based retrieval (RAG), extractive summarization of irrelevant sections, and conversation compression (summarizing prior turns instead of including raw history) can all reduce input tokens substantially.
Practical rule: aim to keep prompt payloads under 10K tokens for most production calls. Use retrieval, not full-context, for long documents.
Streaming
Streaming delivers model output token by token as it is generated, rather than waiting for the full response. This does not change the total cost or token count. It dramatically improves perceived latency: the user sees the first word in under a second rather than waiting for a complete response.
Streaming is the default for most consumer-facing applications. It is inappropriate for batch processing or cases where you need the full response before taking action (e.g., function calling where you parse the complete output).
Latency considerations
Time-to-first-token (TTFT) and tokens per second (TPS) are the two latency metrics.
TTFT depends on: model size, provider infrastructure, prompt length (longer prompts take longer to prefill), and whether the response is cached. Frontier models (Opus, GPT-5.5) have higher TTFT than mid-tier models. Groq and Fireworks AI offer optimized inference with notably lower TTFT for supported models.
TPS is relatively stable within a model tier. At current frontier model speeds, generating a 500-token response takes roughly 5-15 seconds depending on the model and provider.
For applications where latency matters: use the smallest model that meets quality requirements. Gemini 3.5 Flash is explicitly optimised for speed at its tier. Claude Haiku 4.5 is the fastest in the Claude family and meaningfully faster than Claude Opus 4.8.
Self-hosting vs. managed API
Managed API (OpenAI, Anthropic, Google):
- No infrastructure overhead
- Pay per token, no fixed cost
- Automatic scaling
- No access to weights; cannot fine-tune beyond their tools
- Privacy: data leaves your infrastructure
- Provider risk: pricing changes, API changes, model deprecations
- 10-20% markup when accessed through AWS Bedrock or Azure OpenAI
Self-hosted open-weight (Llama 4, DeepSeek V4, Qwen3):
- Fixed infrastructure cost: roughly $1-3/hour per GPU on cloud providers
- No per-token cost; economics improve dramatically at high volume
- Full control: fine-tuning, quantization, custom serving configurations
- Data stays on your infrastructure
- Engineering overhead: MLOps, model serving, monitoring, updates
- A100/H100 GPUs required for large models; smaller models run on consumer hardware
Break-even analysis: for a team spending $5,000+/month on API calls, self-hosting a comparable open-weight model on a leased H100 ($3-4/hour) typically pays for itself within weeks. The catch is ML engineering capacity to maintain it.
Aggregators like Together AI, Fireworks AI, and OpenRouter provide open-weight model inference via API without the self-hosting complexity, at prices between managed APIs and raw compute costs.