Skip to content

AI Frameworks and Tooling

The Landscape 12 min read

In Short

Building with AI in 2026 means choosing from a dense ecosystem of overlapping tools. Orchestration frameworks (LangChain, LlamaIndex, LangGraph, Haystack, DSPy, Semantic Kernel) handle application logic, agent frameworks (CrewAI, AutoGen, OpenAI Agents SDK) coordinate multi-agent work, and a separate tier of local runners, model hubs, and observability tools completes the stack. The biggest practical debate is whether to use a heavy framework at all or write direct API calls, and the answer now depends on your stage and use case more than any framework's feature list.

01. What It Is

The AI tooling ecosystem is the collection of libraries, platforms, and services that sit between raw model APIs and a finished AI-powered application. It covers how you structure prompts, connect to data, coordinate agents, serve models locally, access hosted models, and observe what your system is doing in production.

02. Why It Matters

No team ships an LLM application as a single requests.post() call in production. At minimum, they need retrieval, tool use, memory, error handling, cost tracking, and some way to evaluate quality. The frameworks and tools below each solve a slice of that problem. Picking the right ones early saves weeks of rework. Picking the wrong ones, or over-committing to frameworks that add indirection without value, costs months.

03. How It Works / the Categories

Orchestration and app frameworks

These are the libraries that structure how your application calls models, routes data, and chains operations.

LangChain is the oldest and most widely recognized framework in this space. It pioneered the abstraction of chains, agents, and memory over a single LLM API call. Its strengths are a massive ecosystem of integrations (vector stores, document loaders, model providers), a large community, and extensive tutorials. Its well-documented criticisms in 2026 are equally large: the core abstractions have grown more complex across three major versions (v0.1, v0.2, v0.3) without proportional gains in developer value, stack traces during debugging span 15-40 frames of internal framework code, and the roadmap is increasingly oriented toward LangSmith (the paid tracing product) rather than the framework itself. Teams report 40-60% code reduction after migrating to raw SDKs. LangChain still makes sense for prototyping, for RAG-heavy applications where the document-loading and retrieval ecosystem pays off, and for teams that need broad multi-provider flexibility fast.

LlamaIndex is the framework of choice for retrieval-augmented generation (RAG) at depth. It provides 150+ data connectors, multiple indexing strategies, and strong support for structured outputs and query planning. It has lower framework overhead than LangChain (~6 ms vs ~8 ms per call in benchmarks) and a more focused API surface. The criticism is that it is overkill for simple document Q&A with a single vector store. Use it when retrieval architecture is the hard problem.

LangGraph is a graph-based orchestration layer built by the LangChain team that models agent workflows as nodes and edges with shared typed state. It is designed for stateful, long-running workflows with built-in checkpointing, resume-from-failure, and human-in-the-loop support. It surpassed CrewAI in GitHub stars in early 2026 driven by enterprise adoption. Because it is built on LangChain, it inherits some of its dependency footprint, and the graph-based mental model has a steeper learning curve than role-based frameworks. Noted separately under agent frameworks below because it spans both categories.

Haystack (by deepset) is the most architecturally principled framework in the group. Every component has typed inputs and outputs. Pipelines are directed acyclic graphs that can be visualized, debugged, and tested node by node. It has the lowest token usage per call of any framework (~1.57k tokens vs ~2.40k for LangChain) and the lowest overhead. Enterprise teams in regulated industries prefer it for auditability and reproducibility. The tradeoff is a smaller ecosystem and a steeper ramp for simple tasks.

DSPy (from Stanford) takes a fundamentally different approach. Instead of writing prompts manually, you define the signature of a reasoning step (input fields, output fields, instructions) and let the framework compile and optimize those prompts automatically using a dataset of examples. It is the lowest-overhead framework (~3.53 ms per call) and represents the direction the field is heading for teams that run systematic evals. The learning curve requires a genuine mindset shift, and the ecosystem is smaller than LangChain or LlamaIndex.

Semantic Kernel (Microsoft) is the enterprise .NET and Java framework for AI. On April 3, 2026, Microsoft shipped Agent Framework 1.0 GA, which unifies Semantic Kernel and AutoGen into a single SDK (Microsoft.Agents.AI) supporting .NET, Java, and Python. If your team works in .NET or Java, this is the default choice. If you work in Python and are not in a Microsoft ecosystem, it is rarely the first pick.

Agent frameworks

Agent frameworks coordinate multiple AI agents working together on a task.

CrewAI uses a role-based model: you define agents with names, roles, goals, and tools, then assign them tasks. It has the lowest barrier to entry of any agent framework and the fastest path from concept to working demo. Its weaknesses are limited support for non-linear workflows and no built-in checkpointing. Best for business process automation where agents have clearly defined roles.

AutoGen (Microsoft) was the original conversational multi-agent framework. As of 2026, Microsoft has shifted primary development to the broader Agent Framework (merging AutoGen and Semantic Kernel into Microsoft.Agents.AI). AutoGen is now in maintenance mode. Existing projects can stay on it, but new projects should prefer the unified Agent Framework or an alternative.

OpenAI Agents SDK (Python, with TypeScript planned) is the model-native harness from OpenAI. It is optimized specifically for how frontier models (GPT-5.4 and above) perform best on long-running, multi-step, multi-tool tasks. It includes built-in sandboxed execution (via partners including E2B, Modal, Vercel, and Cloudflare), a subagents API for nested delegation, and a code mode for write-and-execute workflows. Teams migrating from LangChain to the Agents SDK commonly report 40-50% code reduction and 8-22% latency improvement. The tradeoff is OpenAI lock-in: the SDK is not model-agnostic.

LangGraph spans both orchestration and agent categories. For stateful agent workflows where you need checkpointing, audit trails, rollback, and human review steps, it is currently the most production-battle-tested option in the open-source ecosystem.

Local and self-hosted model runners

Ollama is the default developer tool for running models locally. A single install, ollama pull to fetch a model, ollama run to chat. It wraps llama.cpp on x86 and MLX on Apple Silicon. Peak throughput is around 40 tokens/second. Under concurrent load it struggles, but for single-developer prototyping it requires zero configuration. Exposes an OpenAI-compatible REST API, so tools built against OpenAI can switch to local models by changing a base URL.

LM Studio is Ollama's GUI counterpart. It provides a graphical model browser, chat UI, and built-in search of the Hugging Face hub without a terminal. Since version 0.4.0 it supports headless deployment with continuous batching, making it viable for small team servers. Best for non-technical team members and evaluation workflows.

llama.cpp is the C/C++ reference implementation that powers most of the above tools internally. Fastest raw throughput on CPU (10-20% faster than Ollama on the same hardware). Use it directly for embedded deployments, unusual hardware, and custom quantization workflows. Most users never interact with it directly.

vLLM is the production-grade option. Its PagedAttention and continuous batching algorithms achieve roughly 16-20x Ollama's throughput under concurrent load, reaching 800-12,500 tokens/second depending on GPU. Runs on NVIDIA and AMD data-center GPUs. For any multi-user or customer-facing deployment, vLLM is the correct choice over Ollama. Not designed for developer laptops.

TGI (Text Generation Inference) is Hugging Face's production inference server. It supports tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and streaming. In benchmarks it is comparable to vLLM. Its advantage is tight integration with the Hugging Face Hub, deployment on Hugging Face Inference Endpoints, and broad model support including vision-language models. A good default if you are already in the Hugging Face ecosystem.

Model hubs and access

Hugging Face is the reference platform for the open ML ecosystem. The Hub hosts over 2 million models, 500,000+ datasets, and around 1 million interactive Spaces. The transformers library is the standard interface for loading, fine-tuning, and running open-weight models. Hugging Face Inference Endpoints let you deploy any Hub model to a managed GPU. For open-weight model discovery, evaluation, and fine-tuning, it is the starting point for nearly every team.

OpenRouter provides a single API key and unified interface for 300+ models across providers. It routes requests to the cheapest or fastest provider that serves the model. The 5.5% platform fee can compound at scale, but for multi-model prototyping it eliminates credential management across a dozen providers.

Groq and Cerebras are hardware-accelerated inference specialists. Groq uses its own LPU (Language Processing Unit) chips to achieve the lowest time-to-first-token latency on the market. Best for real-time chat, voice agents, and agentic loops where each round-trip delay compounds. Narrow model catalog is the tradeoff.

Together AI offers broad open-weight model selection with built-in fine-tuning pipelines. Strong for research-stage workloads where you want to iterate on model versions without switching providers.

Fireworks AI is optimized for latency-sensitive production deployments: customer-facing chat, real-time code completion, and agentic workflows.

Replicate provides a simple API for running open-source models (and some fine-tuned variants) without managing GPU infrastructure. Useful for occasional or bursty workloads.

Observability and evals tooling

LangSmith is purpose-built for LangChain and LangGraph. It provides node-by-node state diffs, full agent execution graphs, model and tool call breakdowns, and replay against new model versions. If you are in the LangChain/LangGraph ecosystem, it is the natural pairing. The criticism from the community is that it deepens LangChain vendor lock-in, since tracing hooks are enabled by default and increasingly difficult to opt out of.

Langfuse is the self-hosted leader. Open-source, Postgres plus ClickHouse backend, fully OSS-compatible. Acquired by ClickHouse in January 2026, with current capabilities unchanged. Best for teams with self-host requirements or those that want to avoid a second SaaS dependency.

Arize Phoenix is an open-source, OpenTelemetry-native observability project from Arize AI. It captures agent traces, runs LLM-as-judge evaluation, measures RAG metrics, and manages datasets. Arize's commercial platform adds drift detection and enterprise compliance. Best for ML-heritage teams that need evaluation rigor comparable to traditional ML monitoring.

Weights and Biases (W&B Weave) is W&B's LLM observability product, layered on their existing experiment tracking platform. It treats sessions, turns, steps, tools, and sub-agents as first-class concepts. Includes pre-built scorers for safety (toxicity, PII detection, hallucination), quality, and regression prevention. Teams that already use W&B for traditional ML training find it a natural extension for LLM workloads.

Vector databases

Vector databases are the storage layer for embeddings used in RAG.
They are covered in detail in Vector Databases. The main players in 2026 are Pinecone, Weaviate, Qdrant, Chroma, pgvector (PostgreSQL extension), and Milvus.

04. Key Tools at a Glance

Tool Category What it is for
LangChain Orchestration Rapid prototyping, RAG pipelines, multi-provider apps
LlamaIndex Orchestration Deep RAG, complex retrieval architectures
LangGraph Orchestration + Agents Stateful, checkpointed, graph-based agent workflows
Haystack Orchestration Auditable, typed pipelines for regulated or enterprise use
DSPy Orchestration Programmatic prompt optimization, systematic evals
Semantic Kernel Orchestration + Agents .NET/Java enterprise AI, Microsoft ecosystem
CrewAI Agents Role-based multi-agent teams, fast prototyping
AutoGen Agents (Maintenance mode) Conversational multi-agent
OpenAI Agents SDK Agents OpenAI-native, sandboxed, long-running agent tasks
Ollama Local runner Single-developer local inference, OpenAI-compatible API
LM Studio Local runner GUI-based local inference, team evaluation
llama.cpp Local runner Embedded systems, custom quantization, raw speed
vLLM Local runner Multi-user production serving on NVIDIA/AMD GPU
TGI Local runner Hugging Face ecosystem production serving
Hugging Face Model hub Open-weight models, datasets, fine-tuning, Spaces
OpenRouter Model access Single API key for 300+ models
Groq Model access Lowest latency inference, real-time applications
Together AI Model access Broad open-weight catalog with fine-tuning
Fireworks AI Model access Latency-optimized production inference
Replicate Model access Simple API for bursty or occasional open-source workloads
LangSmith Observability Tracing and evals for LangChain/LangGraph projects
Langfuse Observability Self-hosted OSS observability, any framework
Arize Phoenix Observability OpenTelemetry-native evals and RAG metrics
W&B Weave Observability LLM observability for teams already using W&B

05. How to Choose / the Framework Debate

The honest practitioner debate in 2026 is not which framework is best but whether to use a framework at all.

The case for direct API calls: Model providers (OpenAI, Anthropic, Google) have absorbed the core abstractions that justified LangChain in 2022. Tool use, structured output, streaming, function calling, multi-turn memory, and retrieval augmentation are now first-class features of the provider SDKs. A tightly scoped chatbot or single-tool agent can be implemented in 50-100 lines with the raw SDK, easier to debug (no 40-frame stack traces), no extra dependencies, and 8-22% lower latency. Teams that have migrated from LangChain report 40-60% code reduction.

The case for frameworks: For applications that genuinely span multiple providers, multiple retrieval steps, multiple agents, and complex state management, starting from scratch means rebuilding document loaders, chunking strategies, embedding pipelines, retry logic, and evaluation tooling. LlamaIndex for deep RAG, LangGraph for stateful multi-step agent workflows, and Haystack for auditable enterprise pipelines each save weeks of work for the right use case.

A practical heuristic: Use direct SDK calls for single-model, single-tool, simple-retrieval applications. Reach for LlamaIndex when retrieval is the hard part. Reach for LangGraph or CrewAI when multi-agent coordination is the hard part. Reach for Haystack when auditability and typed contracts are required. Use DSPy when prompt quality and systematic evaluation are the hard part. Avoid frameworks as a first move, and add them only when the boilerplate they replace is the actual bottleneck.

06. Common Pitfalls / Misconceptions

Frameworks as a substitute for understanding models:
Teams that abstract everything behind LangChain often cannot explain why their pipeline fails, because the framework hides model behavior. Start with direct calls to understand the model, then abstract.

vLLM for laptop development:
vLLM requires Linux and a data-center-class GPU. Trying to run it on a developer laptop wastes hours. Use Ollama for local work, vLLM for production GPU servers.

LangSmith as "optional."
LangChain's documentation increasingly assumes LangSmith is in your stack. If you want to avoid the paid dependency, use Langfuse or Arize Phoenix from the start, rather than discovering the lock-in six months in.

AutoGen for new projects:
Microsoft has moved active development to the broader Agent Framework (Semantic Kernel plus AutoGen merged). Starting a new project on legacy AutoGen means targeting a maintenance-only codebase.

Treating OpenRouter as free:
The 5.5% platform fee is charged on top of every provider's rate. At scale, that is a significant line item. Run the numbers before committing to it for production volume.

Framework version churn:
LangChain has gone through three breaking version cycles. Pinning to a specific version and controlling upgrade timing deliberately reduces surprise maintenance costs.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Orchestration framework
A library that structures how an app calls models, routes data, and chains operations.
Agent framework
Software that coordinates multiple AI agents working together on a task.
RAG
Retrieval-augmented generation, where retrieved data is fed to a model to answer queries.
Local model runner
A tool that serves and runs AI models locally instead of via a hosted API.
Model hub
A platform for discovering, accessing, and deploying open-weight models.

Tags

#ai-frameworks #llm #rag #agents #orchestration #observability

More in Models & Providers