03. How Each Works
Prompt Engineering
The model receives a carefully designed system prompt and user message. Techniques include:
- Zero-shot prompting: plain instructions with no examples
- Few-shot prompting: 2 to 10 examples of input/output pairs included in the prompt
- Chain-of-thought (CoT): asking the model to reason step by step before answering
- Role prompting: assigning an expert persona ("You are a senior tax attorney...")
- Structured output instructions: specifying JSON schema, markdown format, or field constraints
Cost is essentially zero beyond the token count. Iteration speed is fast. The downside is that the model's knowledge is still bounded by its training cutoff and it can still hallucinate on factual queries.
RAG
Documents are chunked, embedded, and stored in a vector database during an offline ingestion phase. At query time:
- The user query is embedded using the same model.
- Semantically similar chunks are retrieved.
- Retrieved chunks are injected into the LLM prompt as context.
- The LLM generates a response grounded in those chunks.
RAG solves the knowledge problem without touching model weights. The knowledge base can be updated continuously. The model can cite sources. Private data stays within your infrastructure.
The infrastructure overhead is real: you need an embedding model, a vector database, retrieval logic, and ideally a reranker. Monthly operational cost for a managed setup runs roughly $70 to $1000 depending on index size and query volume.
Fine-Tuning
A labeled dataset of input-output pairs is prepared. The model is trained further on this dataset, updating its weights through backpropagation. The result is a model that has internalized the patterns in the training data.
Fine-tuning is appropriate for:
- Teaching a specific output format or writing style that prompting alone cannot consistently produce
- Domain adaptation where the base model consistently misuses specialized vocabulary
- Behavioral conditioning (always respond formally, never refuse certain categories of requests) that needs to be reliably consistent
- Reducing prompt length at inference time by baking instructions into weights
The cost is substantial: compute time for training, labeled data preparation, evaluation, and often higher per-token inference cost because you are running a custom model endpoint rather than a shared API. Iteration cycles take days to weeks, not hours. Fine-tuned models are also brittle: they may lose general capability on tasks outside the fine-tuning distribution (catastrophic forgetting).