Skip to content

MLOps and LLMOps

Under the Hood 7 min read

In Short

MLOps applies DevOps practices to the full machine learning lifecycle, covering experiment tracking, model versioning, CI/CD, monitoring, and retraining. LLMOps extends this for large language models, adding prompt versioning, eval pipelines, cost and latency monitoring, and guardrails that have no direct analogue in classical ML workflows.

01. What It Is

MLOps (Machine Learning Operations) is the discipline of bringing software engineering rigor to the lifecycle of ML models in production. It covers everything from data ingestion and experiment tracking through model training, evaluation, deployment, and monitoring, and back to retraining when performance degrades.

LLMOps applies the same discipline to systems built on large language models. The core concerns are similar, but the specifics are different enough to warrant a separate treatment. Traditional ML models are trained from scratch or fine-tuned on domain data and then deployed as static artifacts. LLM-based applications often sit on top of a foundation model that the developer did not train, invoking it via API, and the primary engineering surface is the prompt rather than model weights.

Both disciplines are responses to a well-documented failure mode: models that work well in notebooks often fail in production. The reasons include data drift, model drift, silent failures, unreproducible experiments, and the absence of quality gates between development and deployment.

02. Why It Matters

A model that is not monitored, versioned, and maintained degrades silently. Data distributions shift. User behavior changes. The ground truth the model was trained on may no longer reflect reality. Without MLOps infrastructure, these regressions are invisible until they cause business damage.

For LLMs the stakes are compounded by cost and non-determinism. A prompt change that looks harmless can double token consumption or introduce subtle quality regressions. Because outputs are text, not a scalar metric, defining "correct" requires careful eval design. Regulatory requirements in finance, healthcare, and the EU AI Act add pressure to demonstrate that deployed models behave as intended and can be audited.

03. How It Works

The ML lifecycle

The classical ML lifecycle has six stages: data preparation, feature engineering, model training, evaluation, deployment, and monitoring. MLOps tools wire these stages together with automation and auditability.

Experiment tracking captures hyperparameters, code version, data snapshot, and metrics for every training run. MLflow is the open-source standard. A run records parameters via mlflow.log_param() and metrics via mlflow.log_metric(), and MLflow 3 introduced direct model checkpointing so that individual checkpoints within a run can be ranked and compared. Every run is associated with an experiment, which groups related runs together for comparison.

Model versioning and registries provide a centralized store for trained artifacts. MLflow's Model Registry assigns version numbers to registered models, supports aliases such as @champion and @challenger for deployment routing, and links each version back to the training run that produced it. Google Cloud's Vertex AI Model Registry and Azure ML serve the same function for cloud-hosted workflows.

CI/CD for ML adds automated training pipelines, evaluation gates, and promotion logic to standard CI/CD infrastructure. A commit that changes training code triggers a pipeline that retrains the model, runs evaluation, and promotes to staging if quality thresholds are met.

Feature stores decouple feature computation from model training, ensuring that the features used at training time match the features served at inference time. Temporal leakage, where future data contaminates training features, is one of the most common bugs in production ML. Feature stores enforce point-in-time correctness.

Data and model drift monitoring compares the statistical distribution of incoming data and model outputs against a training-time baseline. Drift does not necessarily mean the model is wrong, but it is a signal that re-evaluation is needed. Statistical tests such as the Kolmogorov-Smirnov test and Population Stability Index are commonly applied. WhyLabs built an open-source toolkit (whylogs) specifically for this.

Retraining closes the loop. When monitoring signals degrade, the pipeline triggers a new training run. The question of when to retrain, how often, and on what data is specific to each use case and often requires a human decision point.

What is different for LLMs

Prompt versioning replaces weight versioning as the primary artifact to track. A prompt is a text string that can change a model's behavior as dramatically as retraining. Prompt versions should be stored in a registry, associated with evaluation results, and subject to the same promotion gates as model versions.

Eval pipelines replace accuracy metrics on held-out test sets. Because LLM outputs are open-ended text, evaluation requires LLM-graded rubrics, human review, or both. Anthropic's evaluation framework asks developers to define success criteria that are specific, measurable, achievable, and relevant, then build automated graders (string match, code-based, or LLM-as-judge) against a representative dataset. OpenAI's Evals API provides infrastructure for running these pipelines at scale: an eval is a named configuration of test data schema and grading criteria, and a run executes a prompt against a dataset and returns per-criteria pass/fail counts.

Cost and latency monitoring are first-class concerns. Each API call to a foundation model has a direct dollar cost tied to token count. A prompt regression that increases output length by 30% is a cost regression. Latency SLAs require tracking time-to-first-token and total generation time per request.

Guardrails are runtime filters applied to LLM inputs and outputs. They cover toxicity, off-topic content, PII leakage, hallucination rate, and policy compliance. They are a form of monitoring feedback that can trigger alerts or block responses.

Observability for LLMs means tracing the full context window, not just input/output pairs. MLflow's GenAI tracing module and tools like LangSmith capture the chain of reasoning, tool calls, and retrieved documents for each inference, making it possible to diagnose why a particular response was generated.

04. Key Terms and Methods

Experiment: A named group of related training runs in MLflow. The unit of organization for a modeling effort.

Run: A single execution of training or inference code. Records parameters, metrics, and artifacts.

Model registry: A versioned store for trained model artifacts with promotion workflows.

Feature store: A system for storing and serving ML features with point-in-time correctness.

Data drift: A change in the statistical distribution of input data relative to the training distribution.

Concept drift: A change in the relationship between inputs and the correct output, independent of input distribution.

Prompt versioning: Treating prompt text as a versioned artifact, stored and evaluated with the same rigor as model weights.

LLM-as-judge: Using an LLM to grade the output of another LLM against a rubric. Fast and scalable but requires validation that the judge itself is reliable.

Eval pipeline: An automated system that runs a prompt and model against a labeled dataset and reports quality metrics. Central to both Anthropic's and OpenAI's recommended LLMOps practices.

Guardrails: Runtime input and output filters for LLM applications. Both Nvidia NeMo Guardrails and open-source alternatives implement these as configurable rule sets.

Canary deployment: Rolling out a new model version to a small percentage of traffic before full promotion, to detect regressions with limited user impact.

05. Examples

A team at a financial services firm fine-tunes a model for document classification. They use MLflow to log each training run. When the model is promoted to the registry with an @production alias, a CI pipeline automatically runs their eval suite against a 500-item labeled dataset. If accuracy drops below 94%, promotion is blocked.

An LLM-powered customer service bot is deployed with a prompt that routes queries to specialized handlers. The team versions their system prompt in a lightweight registry alongside eval scores. When a prompt change reduces hallucination rate from 3.2% to 1.8% on their eval set, it is merged. Token cost per conversation is tracked as a dashboard metric. A sudden 40% spike triggers an alert that diagnoses an accidental prompt change that caused the model to repeat itself.

06. Common Pitfalls and Misconceptions

"MLOps is just DevOps for models."
The analogy holds at a high level but breaks down in detail. Code is deterministic. Models are not. Testing a model requires statistical thinking that standard unit tests do not provide.

"Eval sets age well."
They do not. An eval set assembled in 2023 may not reflect the failure modes of a product that has evolved since. Eval sets need active maintenance.

"Monitoring inputs is enough."
Output quality can degrade even when input distributions are stable, because the model's behavior on edge cases may be sensitive to subtle distribution shifts that aggregate statistics miss. Both input and output monitoring are required.

"LLMOps is only for fine-tuned models."
Prompt engineering, cost, and quality all require operational discipline even when the underlying model weights are never touched. Most LLMOps work happens at the prompt and eval layer.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

MLOps
DevOps rigor applied to the full machine learning lifecycle.
LLMOps
MLOps extended for LLMs, adding prompt versioning, evals, and cost monitoring.
Model drift
Silent performance decay as data shifts after deployment.

Tags

#mlops #llmops #operations #deployment #monitoring

More in Training & Tuning