As machine learning is moving towards automated and streamlined tasks and processes, it is crucial to keep track of all the multiple runs, metrics and versions of the models. Version control practices, as commonly used in software development lifecycles, has numerous known benefits.
Why ML versioning is important
- Finding the best model from multiple runs and hyperparameters settings
- Failure tolerance - to revert to working models incase of failure
- Dependency tracking with regards to datasets, frameworks
- Staged deployment for update cycles
- AI /ML governance - control access, implement policy and model maintenance
MLflow
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It has four primary functions namely, MLflow Tracking, MLflow Projects, Mlflow Models and MLflow registry. You can read more about them in their documentation. For versioning, we utilise the MLfow tracking functionality which tracks experiments to record and compare parameters and results
MLflow tracking provides an API and UI for logging parameters, code versions, metrics and output files when running machine learning code.
The following information is recorded for each run; Code version, start & end time, parameters used, model metrics and output artifacts from the model. This tracking can be machine learning library agnostic and runs can be recorded through multiple MLflow APIs- Python, R, Java and REST
The MLFlow runs can be logged to local directory, to a database or a remote tracking server. For a local directory, as shown in figure below; the artifact and backend stores are situated in the ./mlruns folder on the local directory. For remote tracking, cloud provider solutions like AWS S3 and RDS can be used for artifact and backend storage plus AWS SageMaker as the Machine learning workbench.
Below is a basic Python example:
pip or conda install sklearn, mlflow and joblib libraries before hand.
MLFlow has a wide range of interesting possibilities for versioning and various integrations with popular libraries and cloud providers.
Resources
- For more information, have a look at MLflow documentation
Other ML versioning and lineage tools include:
- Data Version Control(DVC) : Open Source Version Control System for Machine Learning Project
- Pachyderm : Data Lineage with End-to-End Pipelines