ML in Production: The Hard Parts | LearningData.online
AI & ML·11 min
ML in Production: The Hard Parts
mlopsdata-governancemachine-learning-production
Putting an ML model into production requires far more than exporting a trained artifact: it demands consistent feature computation, rigorous versioning and lineage, safe deployment practices, and continuous monitoring across data, model behavior, and system health. Treating ML as a governed lifecycle (data + model + operations) is essential to maintain reliability as data and business conditions change.
Context and problem statement
Many machine learning models look accurate in notebooks but underperform or fail once integrated into real systems. The root cause is rarely “the algorithm” alone; production ML is a socio-technical system that must reliably convert changing, messy operational data into predictions, at required latency and scale, with traceability, controls, and continuous feedback.
A practical way to frame the “hard parts” is to treat a model as part of a governed data product: it has inputs (features), transformations, an interface (serving), quality expectations (SLOs), and lifecycle management (change control).
Core concepts and definitions (production-focused)
Training vs. serving: Training typically uses historical datasets, while serving consumes live events or operational records. A model that performs well offline can fail online due to differences in data availability, timing, or preprocessing.
Training-serving skew: Any mismatch between how features are computed in training and in production (different logic, different reference data, different time windows, missing values handled differently).
Data drift and concept drift:
Data drift (covariate shift) is a change in the distribution of inputs.
Concept drift is a change in the relationship between inputs and outcomes.
Both can degrade performance and should be explicitly monitored.
Point-in-time correctness: When building training sets from historical data, feature values must reflect what was known at prediction time (preventing target leakage and backfill bias).
Model artifact and lineage: The model binary plus its dependencies (code, parameters, training data version, feature definitions, environment) required for reproducibility and auditability.
Operational SLOs: Latency, throughput, availability, and error budgets for prediction services.
Lifecycle and governance foundations
A production ML system spans multiple disciplines that are covered by established practices:
Lifecycle management (MLOps/DevOps principles): Treat ML assets (code, data, features, models) as versioned, testable, deployable units with automated pipelines and controlled releases.
Data management (DAMA-DMBOK): Apply data governance, data quality management, metadata management, and master/reference data practices to the data powering the model.
Architecture and integration (TOGAF): Define clear architecture building blocks and interfaces across data sources, feature pipelines, serving, monitoring, and downstream applications.
Analytics/engineering lifecycle alignment (ADLC-style thinking): Manage requirements, development, testing, deployment, and operations as a continuous lifecycle rather than a one-time handoff.
The hard parts in practice
1) Feature engineering is an engineering system, not a notebook step
The most common production failures come from feature computation:
Dual pipelines: One pipeline generates features for training (batch) and another for serving (streaming/online). Divergent logic creates skew.
Freshness vs. correctness trade-offs: Real-time features require up-to-date data, but late-arriving events can corrupt aggregates.
Join complexity: Online joins across multiple operational stores can be slow or unreliable.
Practical patterns:
Single-source feature definitions: Define feature logic once and reuse it for training and serving (or compile from the same semantic definition).
Feature store or feature registry: Use a managed mechanism to store feature definitions, enforce consistency, and support both offline and online access.
Time-windowed aggregates with clear event-time semantics: Track event time vs. processing time, handle late data explicitly, and keep audit fields for recomputation.
Data quality gates on features: Validate ranges, null rates, categorical cardinality, and schema changes before features reach training or serving.
2) Reproducibility and lineage require versioning beyond “the model file”
Model performance depends on data and code. Versioning should cover:
Training dataset snapshot/version (including sampling rules and filters)
Feature definitions and reference data versions
Training code and configuration (hyperparameters, random seeds)
Define alert thresholds and runbooks tied to business impact.
Use champion–challenger comparisons where possible (compare current model to baseline).
Monitor data quality SLAs for upstream sources, not only the model endpoint.
5) Retraining is a controlled pipeline, not an ad hoc refresh
Retraining strategies should be explicit and auditable:
Time-based retraining (e.g., monthly) when drift is expected but labels arrive regularly.
Drift-triggered retraining when data distribution changes exceed thresholds.
Performance-triggered retraining when validated metrics degrade (requires reliable ground truth).
Pipeline best practices:
Separate candidate model training from promotion to production with approval gates.
Keep a stable baseline model for comparison.
Re-run data validation and unit/integration tests on every training run.
6) Explainability, privacy, and risk controls are production requirements
Many organizations need governance controls comparable to other critical systems:
Explainability: Provide model-appropriate explanations (global feature importance, local explanation methods) and document limitations.
Compliance and privacy: Minimize sensitive attributes, enforce access controls, and follow data retention policies.
Model risk management: Document intended use, known failure modes, and monitoring coverage; define escalation paths when harm or compliance risk is detected.
A reference architecture checklist (what to implement)
Offline training dataset builder with point-in-time correctness
Online feature retrieval designed for latency and reliability
Data quality checks and schema enforcement
Model lifecycle layer
Experiment tracking (parameters, data versions, metrics)
Model registry with versioning, lineage, and approval workflow
Automated training pipeline (CI for ML) and reproducible environments
Serving layer
Clear model API contracts (schemas, error handling)
Release strategies (shadow/canary/A-B) and rollback procedures
SLOs for latency/availability plus capacity planning
Observability and operations
Data + model + system monitoring dashboards
Alerting with runbooks and incident management
Periodic evaluation reports and drift reviews
Best practices and common pitfalls
Best practices:
Align feature computation across training and serving to minimize skew.
Treat data quality and metadata management as first-class production controls (owners, definitions, lineage).
Define success metrics at three levels: technical (AUC/RMSE), operational (latency/uptime), and business (conversion, cost, risk).
Build safe deployment paths and test rollback regularly.
Monitor segments and ground-truth lag explicitly; avoid “one global metric” monitoring.
Pitfalls:
Training on leaked or future information due to missing point-in-time controls.
Relying on manual retraining and undocumented thresholds.
Monitoring only endpoint uptime while ignoring input drift and prediction drift.
Shipping a model without clarifying how decisions will be overridden or handled in edge cases.
Summary of key takeaways
Production ML success depends on disciplined data management, controlled lifecycle practices, and robust operational monitoring. The hardest parts are typically feature consistency, reproducibility and lineage, safe deployment, and end-to-end observability. Approaching ML as a governed data product—with explicit contracts, quality controls, and lifecycle ownership—reduces failures and makes improvements sustainable.