Cost, Latency, and Deployment

In Short

Running LLMs in production has concrete, measurable costs dominated by token counts, model tier, and how intelligently you route requests. In 2026, the difference between a naive deployment and an optimized one is often a 10x to 100x cost reduction with minimal quality impact.

01. What It Is

Deploying a language model in production means making architectural decisions that determine how much you pay, how fast responses arrive, and whether the system can handle your usage volume. These decisions include: which model to use, whether to use an API or self-host, how to structure prompts, whether to cache repeated inputs, and whether to process requests in real time or in batches.

This is distinct from model capability selection. A team can choose the "best" model and then misconfigure its deployment to spend 50x more than necessary. Cost and latency optimization is engineering, not just model selection.

02. Why It Matters

At low volume, token costs are invisible. At production scale, they are a line item that determines profitability. A chatbot handling 1,000 daily conversations on a $2.50-$5.00/MTok model can cost hundreds of dollars per month depending on prompt and output length. The same chatbot on a budget model (Gemini 3 Flash or GPT-5.4 mini at under $1/MTok) can cost a fraction of that, often at comparable quality for conversational tasks. Model the actual prompt structure to get accurate estimates.

Latency matters for user experience. A 3-second time-to-first-token is acceptable for a research assistant. It is not acceptable for a real-time chat interface. Different applications have different latency budgets, and model choice should reflect that.

03. How It Works

Pricing mechanics

LLM APIs charge per token. Input tokens (your prompt) and output tokens (the model's response) are priced separately. Output tokens cost more because they require a full autoregressive forward pass per token, while input tokens can be processed in parallel.

Token pricing tiers (approximate, as of June 2026):

Tier	Examples	Input ($/1M)	Output ($/1M)
Budget	GPT-5.4 mini ($0.75), Gemini 3 Flash, Mistral Small 4	$0.10-0.80	$0.30-4.50
Mid / Fast	Claude Haiku 4.5 ($1.00), Gemini 3.5 Flash	$1.00-1.50	$4.50-5.00
Production	Claude Sonnet 4.6 ($3.00), GPT-5.4 ($2.50)	$2.50-3.00	$15.00
Frontier	Claude Opus 4.8 ($5.00), GPT-5.5 ($5.00)	$5.00	$25.00-30.00
Open-weight via API	DeepSeek V4 Pro ($0.435), DeepSeek V4 Flash ($0.14)	$0.14-0.44	$0.28-0.87

Prices verified at anthropic.com/pricing and openai.com/api/pricing/ as of June 2026. Verify before quoting as these change.

One practical note: the "input" to a real API call is rarely just the user's message. A typical RAG application has a 50-token user query embedded in a 4,000-token context with retrieved documents and a system prompt. The full payload, not the user message, is what you are paying for.

Prompt caching

When you send the same prefix (system prompt, few-shot examples, retrieved context) across many requests, you are paying to process it every time. Prompt caching stores the KV-cache (the internal state computed from the prefix) and reuses it for subsequent calls with the same prefix.

Anthropic and OpenAI both offer prompt caching at roughly 50% off for cached input tokens (Anthropic offers up to 90% off on their largest cache tier). For applications where a long static system prompt or document is reused across many calls, caching can reduce effective input costs by 40-90%.

Implementation: structure your prompt so the static prefix comes first and the variable content (user message, dynamic context) comes last. Cache invalidation is handled automatically. You pay full price when the prefix changes.

Batch processing

For non-real-time workloads (nightly document processing, bulk evaluation, data extraction pipelines), both OpenAI and Anthropic offer a 50% discount via their batch APIs. You submit a file of requests, they are processed within 24 hours, and you retrieve the results.

Batch and caching stack. A batch request with a cached system prompt can cost roughly 25% of the standard real-time rate.

Tiered routing strategy

Routing sends different requests to different models based on complexity, reducing average cost without reducing quality on hard tasks.

A simple routing strategy:

If the request is a simple lookup, FAQ response, or short classification: route to Gemini 3 Flash or Claude Haiku 4.5 ($0.10-1.00 input).
If the request is moderate complexity (multi-turn reasoning, summarization): route to Claude Sonnet or Gemini 3.5 Flash.
Only if the request requires frontier reasoning (complex code, research synthesis, agentic task): route to Claude Opus or GPT-5.5.

Teams report 60-80% cost reduction with negligible quality impact using this strategy. Building a classifier to automate routing is itself a small LLM call, but a cheap one.

Context compression

Long contexts cost more. Before sending a 50,000-token document, consider whether you need it all. Chunking and embedding-based retrieval (RAG), extractive summarization of irrelevant sections, and conversation compression (summarizing prior turns instead of including raw history) can all reduce input tokens substantially.

Practical rule: aim to keep prompt payloads under 10K tokens for most production calls. Use retrieval, not full-context, for long documents.

Streaming

Streaming delivers model output token by token as it is generated, rather than waiting for the full response. This does not change the total cost or token count. It dramatically improves perceived latency: the user sees the first word in under a second rather than waiting for a complete response.

Streaming is the default for most consumer-facing applications. It is inappropriate for batch processing or cases where you need the full response before taking action (e.g., function calling where you parse the complete output).

Latency considerations

Time-to-first-token (TTFT) and tokens per second (TPS) are the two latency metrics.

TTFT depends on: model size, provider infrastructure, prompt length (longer prompts take longer to prefill), and whether the response is cached. Frontier models (Opus, GPT-5.5) have higher TTFT than mid-tier models. Groq and Fireworks AI offer optimized inference with notably lower TTFT for supported models.

TPS is relatively stable within a model tier. At current frontier model speeds, generating a 500-token response takes roughly 5-15 seconds depending on the model and provider.

For applications where latency matters: use the smallest model that meets quality requirements. Gemini 3.5 Flash is explicitly optimised for speed at its tier. Claude Haiku 4.5 is the fastest in the Claude family and meaningfully faster than Claude Opus 4.8.

Self-hosting vs. managed API

Managed API (OpenAI, Anthropic, Google):

No infrastructure overhead
Pay per token, no fixed cost
Automatic scaling
No access to weights. Cannot fine-tune beyond their tools
Privacy: data leaves your infrastructure
Provider risk: pricing changes, API changes, model deprecations
10-20% markup when accessed through AWS Bedrock or Azure OpenAI

Self-hosted open-weight (Llama 4, DeepSeek V4, Qwen3):

Fixed infrastructure cost: roughly $1-3/hour per GPU on cloud providers
No per-token cost. Economics improve dramatically at high volume
Full control: fine-tuning, quantization, custom serving configurations
Data stays on your infrastructure
Engineering overhead: MLOps, model serving, monitoring, updates
A100/H100 GPUs required for large models. Smaller models run on consumer hardware

Break-even analysis: for a team spending $5,000+/month on API calls, self-hosting a comparable open-weight model on a leased H100 ($3-4/hour) typically pays for itself within weeks. The catch is ML engineering capacity to maintain it.

Aggregators like Together AI, Fireworks AI, and OpenRouter provide open-weight model inference via API without the self-hosting complexity, at prices between managed APIs and raw compute costs.

04. Model Routing

The tiered strategy above can be automated. A model router is a small, fast component that sits in front of several models, reads each incoming request, estimates how hard it is, and forwards it to the cheapest model likely to answer it well.
Simple requests go to a small, inexpensive model (see Small Language Models), and only the hard ones reach a frontier model (see Model Landscape 2026 for the tiers a router selects among).

Several approaches exist. RouteLLM is an open-source research framework from UC Berkeley and Anyscale that trains the router on human preference data.
Commercial products include OpenRouter's Auto Router, Not Diamond (which also powers that Auto Router), and Martian.
A lighter method, semantic routing, matches the request to labeled examples using embeddings, which is faster and cheaper than asking a model to classify it.

Reported savings are large but benchmark-specific. The RouteLLM team reports cost cuts above 85% on one conversational benchmark (MT Bench) while keeping 95% of GPT-4's performance, although the same routers saved closer to 45% and 35% on two harder benchmarks. Vendor figures run higher and should be read as unverified marketing.
Martian, for instance, advertises savings of "up to 98%".

The tradeoffs are real. A misjudged request gets a weak answer, the routing step adds a little latency of its own, and behavior becomes less predictable because the same prompt may reach a different model tomorrow.

05. Key Terms

Token: The unit of pricing. Roughly 4 characters or 0.75 words in English. A typical paragraph is 100-150 tokens.

TTFT (Time to First Token):
How long until the model starts outputting. The most perceptible latency metric for streaming applications.

Prompt caching:
Storing the computed KV-cache for repeated prompt prefixes to avoid reprocessing.

Batch API:
Non-real-time processing at 50% discount. Results delivered within 24 hours.

Model routing:
Directing requests to different models based on complexity to optimize cost.

RAG (Retrieval-Augmented Generation):
Fetching relevant document chunks at query time rather than including entire documents in the context window.

Quantization:
Reducing the numerical precision of model weights (e.g., from float32 to int4) to fit larger models on less VRAM at some quality cost.

06. Examples

High-volume chatbot:
Route simple greetings and FAQ lookups to Gemini 3 Flash at $0.10/1M input. Only escalate to Claude Sonnet when the conversation involves account-specific data. Implement caching on the system prompt. Result: 90%+ cost reduction vs. always using a frontier model.
Document processing pipeline:
Nightly processing of 10,000 contracts. Use batch API (50% off) + prompt caching for the extraction schema system prompt (50% off cached tokens). Effective cost: roughly 25% of real-time rates.
Autonomous coding agent:
Requires frontier model (Claude Opus 4.8 or GPT-5.5). No meaningful cost-cutting without quality loss. Offset by high per-task value. Design the agent to minimize round trips by batching tool calls.
Startup on a budget:
Start on Gemini 3.5 Flash or DeepSeek V4 (via Together AI). Implement caching from day one. Plan migration to self-hosted DeepSeek when monthly spend crosses $3,000.

07. Common Pitfalls and Misconceptions

"Pricing listed on the website is what you'll pay."
Effective cost depends heavily on caching hit rate, output verbosity, and context size. Model your actual prompt structure before estimating costs.

"Self-hosting is always cheaper at scale."
It is cheaper per-token at volume but requires ML engineering talent to maintain. If you do not have that capacity, the operational overhead outweighs the savings.

"Streaming reduces cost."
Streaming does not change token count or cost. It only improves perceived latency.

"Longer context windows are always better."
Larger context costs more and introduces more noise. Retrieval over large documents is often more accurate and cheaper than stuffing them into the context window.

"The cheapest model that passes your eval is the right choice."
True on day one. On day 60, when your use case has expanded and users are submitting harder queries, that eval may no longer capture real failure modes. Monitor quality in production, not just in pre-launch testing.

"Volume discounts come automatically."
They do not. Negotiate at $5,000+/month. Expect custom terms at $100,000+/month. Providers do not proactively offer discounts. You have to ask.