03. How It Works / the Categories
Orchestration and app frameworks
These are the libraries that structure how your application calls models, routes data, and chains operations.
LangChain is the oldest and most widely recognized framework in this space. It pioneered the abstraction of chains, agents, and memory over a single LLM API call. Its strengths are a massive ecosystem of integrations (vector stores, document loaders, model providers), a large community, and extensive tutorials. Its well-documented criticisms in 2026 are equally large: the core abstractions have grown more complex across three major versions (v0.1, v0.2, v0.3) without proportional gains in developer value, stack traces during debugging span 15-40 frames of internal framework code, and the roadmap is increasingly oriented toward LangSmith (the paid tracing product) rather than the framework itself. Teams report 40-60% code reduction after migrating to raw SDKs. LangChain still makes sense for prototyping, for RAG-heavy applications where the document-loading and retrieval ecosystem pays off, and for teams that need broad multi-provider flexibility fast.
LlamaIndex is the framework of choice for retrieval-augmented generation (RAG) at depth. It provides 150+ data connectors, multiple indexing strategies, and strong support for structured outputs and query planning. It has lower framework overhead than LangChain (~6 ms vs ~8 ms per call in benchmarks) and a more focused API surface. The criticism is that it is overkill for simple document Q&A with a single vector store. Use it when retrieval architecture is the hard problem.
LangGraph is a graph-based orchestration layer built by the LangChain team that models agent workflows as nodes and edges with shared typed state. It is designed for stateful, long-running workflows with built-in checkpointing, resume-from-failure, and human-in-the-loop support. It surpassed CrewAI in GitHub stars in early 2026 driven by enterprise adoption. Because it is built on LangChain, it inherits some of its dependency footprint, and the graph-based mental model has a steeper learning curve than role-based frameworks. Noted separately under agent frameworks below because it spans both categories.
Haystack (by deepset) is the most architecturally principled framework in the group. Every component has typed inputs and outputs. Pipelines are directed acyclic graphs that can be visualized, debugged, and tested node by node. It has the lowest token usage per call of any framework (~1.57k tokens vs ~2.40k for LangChain) and the lowest overhead. Enterprise teams in regulated industries prefer it for auditability and reproducibility. The tradeoff is a smaller ecosystem and a steeper ramp for simple tasks.
DSPy (from Stanford) takes a fundamentally different approach. Instead of writing prompts manually, you define the signature of a reasoning step (input fields, output fields, instructions) and let the framework compile and optimize those prompts automatically using a dataset of examples. It is the lowest-overhead framework (~3.53 ms per call) and represents the direction the field is heading for teams that run systematic evals. The learning curve requires a genuine mindset shift, and the ecosystem is smaller than LangChain or LlamaIndex.
Semantic Kernel (Microsoft) is the enterprise .NET and Java framework for AI. On April 3, 2026, Microsoft shipped Agent Framework 1.0 GA, which unifies Semantic Kernel and AutoGen into a single SDK (Microsoft.Agents.AI) supporting .NET, Java, and Python. If your team works in .NET or Java, this is the default choice. If you work in Python and are not in a Microsoft ecosystem, it is rarely the first pick.
Agent frameworks
Agent frameworks coordinate multiple AI agents working together on a task.
CrewAI uses a role-based model: you define agents with names, roles, goals, and tools, then assign them tasks. It has the lowest barrier to entry of any agent framework and the fastest path from concept to working demo. Its weaknesses are limited support for non-linear workflows and no built-in checkpointing. Best for business process automation where agents have clearly defined roles.
AutoGen (Microsoft) was the original conversational multi-agent framework. As of 2026, Microsoft has shifted primary development to the broader Agent Framework (merging AutoGen and Semantic Kernel into Microsoft.Agents.AI). AutoGen is now in maintenance mode. Existing projects can stay on it, but new projects should prefer the unified Agent Framework or an alternative.
OpenAI Agents SDK (Python, with TypeScript planned) is the model-native harness from OpenAI. It is optimized specifically for how frontier models (GPT-5.4 and above) perform best on long-running, multi-step, multi-tool tasks. It includes built-in sandboxed execution (via partners including E2B, Modal, Vercel, and Cloudflare), a subagents API for nested delegation, and a code mode for write-and-execute workflows. Teams migrating from LangChain to the Agents SDK commonly report 40-50% code reduction and 8-22% latency improvement. The tradeoff is OpenAI lock-in: the SDK is not model-agnostic.
LangGraph spans both orchestration and agent categories. For stateful agent workflows where you need checkpointing, audit trails, rollback, and human review steps, it is currently the most production-battle-tested option in the open-source ecosystem.
Local and self-hosted model runners
Ollama is the default developer tool for running models locally. A single install, ollama pull to fetch a model, ollama run to chat. It wraps llama.cpp on x86 and MLX on Apple Silicon. Peak throughput is around 40 tokens/second. Under concurrent load it struggles, but for single-developer prototyping it requires zero configuration. Exposes an OpenAI-compatible REST API, so tools built against OpenAI can switch to local models by changing a base URL.
LM Studio is Ollama's GUI counterpart. It provides a graphical model browser, chat UI, and built-in search of the Hugging Face hub without a terminal. Since version 0.4.0 it supports headless deployment with continuous batching, making it viable for small team servers. Best for non-technical team members and evaluation workflows.
llama.cpp is the C/C++ reference implementation that powers most of the above tools internally. Fastest raw throughput on CPU (10-20% faster than Ollama on the same hardware). Use it directly for embedded deployments, unusual hardware, and custom quantization workflows. Most users never interact with it directly.
vLLM is the production-grade option. Its PagedAttention and continuous batching algorithms achieve roughly 16-20x Ollama's throughput under concurrent load, reaching 800-12,500 tokens/second depending on GPU. Runs on NVIDIA and AMD data-center GPUs. For any multi-user or customer-facing deployment, vLLM is the correct choice over Ollama. Not designed for developer laptops.
TGI (Text Generation Inference) is Hugging Face's production inference server. It supports tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and streaming. In benchmarks it is comparable to vLLM. Its advantage is tight integration with the Hugging Face Hub, deployment on Hugging Face Inference Endpoints, and broad model support including vision-language models. A good default if you are already in the Hugging Face ecosystem.
Model hubs and access
Hugging Face is the reference platform for the open ML ecosystem. The Hub hosts over 2 million models, 500,000+ datasets, and around 1 million interactive Spaces. The transformers library is the standard interface for loading, fine-tuning, and running open-weight models. Hugging Face Inference Endpoints let you deploy any Hub model to a managed GPU. For open-weight model discovery, evaluation, and fine-tuning, it is the starting point for nearly every team.
OpenRouter provides a single API key and unified interface for 300+ models across providers. It routes requests to the cheapest or fastest provider that serves the model. The 5.5% platform fee can compound at scale, but for multi-model prototyping it eliminates credential management across a dozen providers.
Groq and Cerebras are hardware-accelerated inference specialists. Groq uses its own LPU (Language Processing Unit) chips to achieve the lowest time-to-first-token latency on the market. Best for real-time chat, voice agents, and agentic loops where each round-trip delay compounds. Narrow model catalog is the tradeoff.
Together AI offers broad open-weight model selection with built-in fine-tuning pipelines. Strong for research-stage workloads where you want to iterate on model versions without switching providers.
Fireworks AI is optimized for latency-sensitive production deployments: customer-facing chat, real-time code completion, and agentic workflows.
Replicate provides a simple API for running open-source models (and some fine-tuned variants) without managing GPU infrastructure. Useful for occasional or bursty workloads.
Observability and evals tooling
LangSmith is purpose-built for LangChain and LangGraph. It provides node-by-node state diffs, full agent execution graphs, model and tool call breakdowns, and replay against new model versions. If you are in the LangChain/LangGraph ecosystem, it is the natural pairing. The criticism from the community is that it deepens LangChain vendor lock-in, since tracing hooks are enabled by default and increasingly difficult to opt out of.
Langfuse is the self-hosted leader. Open-source, Postgres plus ClickHouse backend, fully OSS-compatible. Acquired by ClickHouse in January 2026, with current capabilities unchanged. Best for teams with self-host requirements or those that want to avoid a second SaaS dependency.
Arize Phoenix is an open-source, OpenTelemetry-native observability project from Arize AI. It captures agent traces, runs LLM-as-judge evaluation, measures RAG metrics, and manages datasets. Arize's commercial platform adds drift detection and enterprise compliance. Best for ML-heritage teams that need evaluation rigor comparable to traditional ML monitoring.
Weights and Biases (W&B Weave) is W&B's LLM observability product, layered on their existing experiment tracking platform. It treats sessions, turns, steps, tools, and sub-agents as first-class concepts. Includes pre-built scorers for safety (toxicity, PII detection, hallucination), quality, and regression prevention. Teams that already use W&B for traditional ML training find it a natural extension for LLM workloads.
Vector databases
Vector databases are the storage layer for embeddings used in RAG.
They are covered in detail in Vector Databases. The main players in 2026 are Pinecone, Weaviate, Qdrant, Chroma, pgvector (PostgreSQL extension), and Milvus.