MLOps Best Practices: Getting ML Models From Notebook to Production

Why MLOps Matters More Than Your Model

There is a persistent myth in the machine learning community that the model is the hard part. Data scientists spend months refining architectures, tuning hyperparameters, and squeezing out the last fraction of a percentage point in accuracy. Then the model sits in a notebook, waiting for someone to figure out how to deploy it. In many organisations, that wait is indefinite.

MLOps—the set of practices that bring machine learning models from experimentation to production and keep them running reliably—is the discipline that closes this gap. It borrows heavily from DevOps but adds the unique challenges of ML: data dependencies, model versioning, experiment tracking, feature engineering at scale, model drift detection, and the fundamental non-determinism of systems that learn from data.

This article is a practical guide to the MLOps practices we have found most impactful across dozens of production ML deployments. It is not a survey of every tool in the ecosystem—that landscape changes too rapidly to be useful in a static article. Instead, it focuses on the principles and patterns that remain constant regardless of which specific tools you choose.

The Uncomfortable Truth

In production ML systems, the model code typically represents less than five percent of the total codebase. The remaining ninety-five percent is data collection, data verification, feature extraction, serving infrastructure, monitoring, and configuration management. If your investment in engineering is proportional to those percentages, you are likely under-investing in MLOps.

The Notebook-to-Production Gap

Understanding why the gap between notebook and production exists is the first step to closing it. A Jupyter notebook is a wonderful tool for exploration and experimentation. It allows data scientists to iterate quickly, visualise results inline, and tell a narrative story through code. But a notebook is not a software artefact. It has no clear entry point, no dependency management, no error handling, no logging, no testing, and no separation of concerns. It is a scratchpad, not a blueprint.

The gap manifests in several specific ways. First, environment reproducibility: a notebook runs on a data scientist's laptop or a managed notebook service with a specific set of installed packages. Reproducing that environment reliably in a production setting is non-trivial. Second, data access patterns: notebooks typically pull data from static files or ad-hoc database queries, while production systems need to consume data from streaming sources, APIs, or feature stores with proper authentication and error handling. Third, compute requirements: training a model in a notebook uses whatever resources are available on the machine, while production training needs to scale across distributed compute, manage GPU allocation, and handle out-of-memory conditions gracefully.

Fourth, and perhaps most importantly, there is the operational gap. A notebook tells you nothing about how the model should behave when it receives malformed input, when the upstream data source is unavailable, when prediction latency exceeds acceptable thresholds, or when the model's accuracy degrades over time. These operational concerns are invisible during experimentation but dominate the production experience.

Closing this gap requires treating ML systems as software systems, with all the engineering discipline that implies, while accommodating the unique characteristics that make ML different from traditional software. That is the essence of MLOps.

Version Control for ML: Code, Data, and Models

Software engineers version-control their code. MLOps requires versioning three additional artefacts: data, models, and experiments. Each presents unique challenges that standard version control systems were not designed to handle.

Code Versioning

ML code should live in a proper Git repository, not in notebooks. The transition from notebook to modular Python (or R, or Julia) code is the first and most critical step in any MLOps journey. This does not mean abandoning notebooks entirely—they remain valuable for exploration and prototyping—but the code that trains, evaluates, and serves models must be structured as proper software with clear modules, functions, type hints, docstrings, and unit tests.

Data Versioning

Data versioning is harder than code versioning because datasets are large, change frequently, and cannot be meaningfully diffed in the way source code can. Tools like DVC (Data Version Control) address this by storing pointers to data in Git while keeping the actual data in object storage. The principle is straightforward: for any given commit of model code, you should be able to retrieve the exact dataset that was used to train the model at that point in time. Without this capability, reproducibility is impossible.

Model Versioning and Registry

A model registry is the central catalogue of all trained models in the organisation. It stores the model artefact (the serialised model file), along with metadata: which code version produced it, which dataset it was trained on, its evaluation metrics, who trained it, when, and its current lifecycle stage (experimental, staging, production, archived). The model registry is the single source of truth for which model is currently serving in production and provides a clear audit trail for model lineage.

From the Field

We worked with a financial services client who had over forty trained models scattered across team members' laptops and shared drives. Nobody knew which model was running in production, when it had last been retrained, or what data it had been trained on. Implementing a model registry took two weeks and immediately eliminated a class of operational risks that had been accumulating for years.

Experiment Tracking

Every model training run is an experiment with specific inputs (data, hyperparameters, code) and outputs (metrics, artefacts, logs). Experiment tracking systems capture this information automatically, allowing data scientists to compare runs, reproduce results, and make informed decisions about which approaches are worth pursuing. The discipline of tracking every experiment, not just the successful ones, is essential for building institutional knowledge about what works and what does not.

CI/CD for Machine Learning

Continuous integration and continuous deployment are well-established practices in software engineering. Applying them to ML requires extending the traditional CI/CD pipeline to accommodate the unique artefacts and validation requirements of ML systems.

Continuous Integration for ML

CI for ML goes beyond running unit tests on code. A comprehensive ML CI pipeline includes several stages. Code quality checks ensure that training and serving code passes linting, type checking, and formatting standards. Unit tests verify that individual functions behave correctly with known inputs. Data validation checks confirm that the training data conforms to expected schemas, statistical distributions, and quality thresholds. Training smoke tests run a quick training cycle on a small data subset to verify that the training pipeline executes end-to-end without errors. Model validation tests evaluate the trained model against a held-out test set and compare its performance to the currently deployed model and to predefined minimum performance thresholds.

The critical insight is that CI for ML must validate data and models, not just code. A code change that passes all unit tests might still produce a model that performs worse than the current production model. A data pipeline change that passes schema validation might introduce a subtle distribution shift that degrades model accuracy. The CI pipeline must catch these issues before they reach production.

Continuous Deployment for ML

CD for ML is more nuanced than deploying a new version of a web application. Model deployment typically follows a staged process: the new model is first deployed to a staging environment where it serves shadow traffic (receiving the same inputs as the production model but without its outputs being used for real decisions). If shadow performance is satisfactory, the model advances to a canary deployment where a small percentage of real traffic is routed to the new model. If canary metrics are healthy, the model is gradually rolled out to full production traffic.

Automated rollback is essential. If production metrics degrade after a model deployment, the system must be able to automatically revert to the previous model version without human intervention. This requires clear rollback criteria defined in advance and monitoring that triggers rollback when those criteria are met.

Continuous Training

Beyond CI/CD, mature MLOps practices include continuous training (CT): automated pipelines that retrain models on fresh data at regular intervals or when data drift is detected. CT ensures that models stay current as the underlying data distribution evolves, without requiring manual intervention from the data science team.

Feature Stores: The Missing Infrastructure

Feature engineering is the process of transforming raw data into the input variables that ML models consume. In most organisations, this process is duplicated across every model: each data scientist writes their own feature engineering code, often with subtle inconsistencies that make it impossible to share features across models or ensure that training-time and serving-time feature computation is identical.

A feature store centralises feature engineering by providing a shared repository of curated, documented, versioned features that can be used across multiple models. It serves two critical functions. First, it eliminates training-serving skew by ensuring that the exact same feature computation logic is used during both model training and real-time inference. Second, it accelerates model development by allowing data scientists to browse and reuse existing features rather than re-engineering them from scratch for every new project.

The architecture of a feature store typically includes an offline store for batch features used during training (backed by a data warehouse or object storage), an online store for real-time features used during inference (backed by a low-latency key-value store), and a feature registry that catalogues available features with their definitions, owners, freshness guarantees, and data lineage.

Not every organisation needs a feature store from day one. If you have fewer than five models in production and a small data science team, the overhead of building and maintaining a feature store may not be justified. But as the number of models grows, the cost of duplicated feature engineering, the risk of training-serving skew, and the difficulty of maintaining consistency across teams all increase. At that point, a feature store transitions from a nice-to-have to an essential piece of infrastructure.

The build-versus-buy decision for feature stores has become significantly easier in recent years with the maturation of open-source options and managed cloud services. We generally recommend starting with an open-source framework and migrating to a managed service as the operational burden grows, rather than building a custom feature store from scratch.

Model Monitoring and Observability

A model in production is not a static artefact. Its performance will degrade over time as the data it encounters in the real world diverges from the data it was trained on. This phenomenon, known as model drift, is the most important operational challenge in production ML, and addressing it requires a comprehensive monitoring strategy that goes well beyond tracking a few accuracy metrics.

What to Monitor

Effective model monitoring operates at multiple levels. Infrastructure monitoring tracks compute utilisation, memory consumption, prediction latency, throughput, and error rates—the same operational metrics you would track for any production service. Data monitoring tracks the statistical properties of incoming data: feature distributions, missing value rates, outlier frequencies, and schema violations. Detecting data quality issues before they affect model predictions is far better than debugging mysterious accuracy drops after the fact.

Model performance monitoring tracks the model's predictive accuracy over time, comparing its predictions against ground truth labels as they become available. In some applications, ground truth is available immediately (a fraud detection model can be validated against confirmed fraud cases). In others, ground truth is delayed (a churn prediction model must wait months to learn whether a customer actually churned). The monitoring strategy must account for this delay and use proxy metrics where ground truth is not immediately available.

Drift detection specifically monitors for shifts in the input data distribution (data drift) and changes in the relationship between inputs and outputs (concept drift). Statistical tests such as the Kolmogorov-Smirnov test, Population Stability Index, and Jensen-Shannon divergence can quantify the degree of drift and trigger retraining when thresholds are exceeded.

Alerting and Response

Monitoring without actionable alerting is just data collection. Alerts must be tiered by severity: informational alerts for minor drift, warning alerts for performance degradation that has not yet crossed the acceptable threshold, and critical alerts for situations requiring immediate intervention such as serving errors or dramatic accuracy drops. Each alert tier should have a defined response procedure, including who is responsible, what immediate actions to take, and when to escalate.

The best monitoring systems we have built are the ones that tell you a model is going to fail before it actually fails. Proactive drift detection buys you days or weeks of lead time to retrain, rather than discovering the problem when a business stakeholder notices the predictions no longer make sense.

Team and Organisational Patterns

MLOps is not purely a technology problem. The organisational structure and team topology have a profound effect on whether ML systems reach production and stay healthy once deployed.

The ML Platform Team

As the number of ML models in production grows, a dedicated ML platform team becomes essential. This team builds and maintains the shared infrastructure that all ML projects use: the training pipelines, the model registry, the feature store, the serving infrastructure, and the monitoring systems. The platform team does not build models—that remains the responsibility of the data science teams embedded in business units. Instead, the platform team provides the tools, guardrails, and abstractions that enable data scientists to deploy models reliably without needing deep infrastructure expertise.

The ML Engineer Role

The ML engineer sits at the intersection of data science and software engineering. They understand both the mathematical foundations of ML and the engineering practices required for production systems. In our experience, the single most impactful hire an organisation can make to improve its ML production rate is its first ML engineer. This person bridges the gap between data scientists who build models and platform engineers who build infrastructure, translating experimental code into production-ready services.

The specific responsibilities of an ML engineer typically include refactoring notebook code into modular, tested, production-quality software; designing and implementing training and serving pipelines; setting up monitoring and alerting for production models; optimising model inference for latency and throughput; and managing model deployments, rollbacks, and retraining cycles.

Embedding vs Centralising

The question of whether data science and ML engineering teams should be centralised or embedded within business units is a perennial debate. Our experience suggests a hybrid model works best: a centralised ML platform team that provides shared infrastructure and standards, with embedded data scientists and ML engineers who work within business units and understand the domain context. The centralised team ensures consistency and prevents duplication of effort; the embedded teams ensure that ML projects are driven by genuine business needs and have the domain expertise needed for effective feature engineering and model evaluation.

Conclusion: Start With the Basics

MLOps can seem overwhelming, particularly for organisations that are just beginning their ML journey. The ecosystem is vast, the tooling landscape changes rapidly, and the gap between current state and best practice can feel insurmountable. But the path forward is incremental, and the first steps are straightforward.

Start with version control. Get your training code out of notebooks and into Git. Add basic CI that runs tests and validates data before training. Implement a model registry so you know what is running in production and can roll back if needed. Add monitoring that tracks at least prediction volume, latency, error rates, and basic input data distributions. These four capabilities—version control, CI, model registry, and monitoring—form the minimum viable MLOps practice that every organisation running ML in production should have.

From there, expand incrementally. Add a feature store when feature engineering becomes a bottleneck. Implement continuous training when model freshness becomes critical. Build out CD pipelines with shadow deployment and canary releases when deployment frequency increases. Each addition builds on the last, and each delivers immediate value.

The goal is not to implement every MLOps practice simultaneously. It is to build the engineering discipline that allows your ML systems to improve reliably over time, just as your software systems do.

MLOps is not a destination. It is a practice. The organisations that treat it as an ongoing discipline, rather than a one-off implementation, are the ones whose AI systems consistently deliver and sustain value.

Need help getting your models to production?

We help organisations build the MLOps foundations that turn ML experiments into production systems. Book a free 30-minute consultation to assess your current ML maturity and identify the highest-impact improvements.

Book a Free AI Strategy Call