03. How It Works
The ML lifecycle
The classical ML lifecycle has six stages: data preparation, feature engineering, model training, evaluation, deployment, and monitoring. MLOps tools wire these stages together with automation and auditability.
Experiment tracking captures hyperparameters, code version, data snapshot, and metrics for every training run. MLflow is the open-source standard. A run records parameters via mlflow.log_param() and metrics via mlflow.log_metric(), and MLflow 3 introduced direct model checkpointing so that individual checkpoints within a run can be ranked and compared. Every run is associated with an experiment, which groups related runs together for comparison.
Model versioning and registries provide a centralized store for trained artifacts. MLflow's Model Registry assigns version numbers to registered models, supports aliases such as @champion and @challenger for deployment routing, and links each version back to the training run that produced it. Google Cloud's Vertex AI Model Registry and Azure ML serve the same function for cloud-hosted workflows.
CI/CD for ML adds automated training pipelines, evaluation gates, and promotion logic to standard CI/CD infrastructure. A commit that changes training code triggers a pipeline that retrains the model, runs evaluation, and promotes to staging if quality thresholds are met.
Feature stores decouple feature computation from model training, ensuring that the features used at training time match the features served at inference time. Temporal leakage, where future data contaminates training features, is one of the most common bugs in production ML. Feature stores enforce point-in-time correctness.
Data and model drift monitoring compares the statistical distribution of incoming data and model outputs against a training-time baseline. Drift does not necessarily mean the model is wrong, but it is a signal that re-evaluation is needed. Statistical tests such as the Kolmogorov-Smirnov test and Population Stability Index are commonly applied. WhyLabs built an open-source toolkit (whylogs) specifically for this.
Retraining closes the loop. When monitoring signals degrade, the pipeline triggers a new training run. The question of when to retrain, how often, and on what data is specific to each use case and often requires a human decision point.
What is different for LLMs
Prompt versioning replaces weight versioning as the primary artifact to track. A prompt is a text string that can change a model's behavior as dramatically as retraining. Prompt versions should be stored in a registry, associated with evaluation results, and subject to the same promotion gates as model versions.
Eval pipelines replace accuracy metrics on held-out test sets. Because LLM outputs are open-ended text, evaluation requires LLM-graded rubrics, human review, or both. Anthropic's evaluation framework asks developers to define success criteria that are specific, measurable, achievable, and relevant, then build automated graders (string match, code-based, or LLM-as-judge) against a representative dataset. OpenAI's Evals API provides infrastructure for running these pipelines at scale: an eval is a named configuration of test data schema and grading criteria, and a run executes a prompt against a dataset and returns per-criteria pass/fail counts.
Cost and latency monitoring are first-class concerns. Each API call to a foundation model has a direct dollar cost tied to token count. A prompt regression that increases output length by 30% is a cost regression. Latency SLAs require tracking time-to-first-token and total generation time per request.
Guardrails are runtime filters applied to LLM inputs and outputs. They cover toxicity, off-topic content, PII leakage, hallucination rate, and policy compliance. They are a form of monitoring feedback that can trigger alerts or block responses.
Observability for LLMs means tracing the full context window, not just input/output pairs. MLflow's GenAI tracing module and tools like LangSmith capture the chain of reasoning, tool calls, and retrieved documents for each inference, making it possible to diagnose why a particular response was generated.