MLflow Tracking
The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. MLflow Tracking lets you log and query experiments using Python, REST, R API, and Java API APIs.
Table of Contents
Concepts
MLflow Tracking is organized around the concept of runs, which are executions of some piece of data science code. Each run records the following information:
- Code Version
- Git commit hash used for the run, if it was run from an MLflow Project.
- Start & End Time
- Start and end time of the run
- Source
- Name of the file to launch the run, or the project name and entry point for the run if run from an MLflow Project.
- Parameters
- Key-value input parameters of your choice. Both keys and values are strings.
- Metrics
- Key-value metrics where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow records and lets you visualize the metric’s full history.
- Artifacts
- Output files in any format. For example, you can record images (for example, PNGs), models (for example, a pickled scikit-learn model), or even data files (for example, a Parquet file) as artifacts.
You can record runs using MLflow Python, R, Java, and REST APIs from anywhere you run your code. For example, you can record them in a standalone program, on a remote cloud machine, or in an interactive notebook. If you record runs in an MLflow Project, MLflow remembers the project URI and source version.
You can optionally organize runs into experiments, which group together runs for a
specific task. You can create an experiment using the mlflow experiments
CLI, with
mlflow.create_experiment()
, or using the corresponding REST parameters. The MLflow API and
UI let you create and search for experiments.
Once your runs have been recorded, you can query them using the Adding Tags to Runs or the MLflow API.
Where Runs Are Recorded
MLflow runs can be recorded to local files, to a SQLAlchemy compatible database, or remotely
to a tracking server.
By default, the MLflow Python API logs runs locally to files in an mlruns
directory wherever you
ran your program. You can then run mlflow ui
to see the logged runs.
To log runs remotely, set the MLFLOW_TRACKING_URI
environment variable to a tracking server’s URI or
call mlflow.set_tracking_uri()
.
There are different kinds of remote tracking URIs:
- Local file path (specified as
file:/my/local/dir
), where data is just directly stored locally. - Database encoded as a connection string (specified as
db_type://<user_name>:<password>@<host>:<port>/<database_name>
) - HTTP server (specified as
https://my-server:5000
), which is a server hosting an MLFlow tracking server. - Databricks workspace (specified as
databricks
or asdatabricks://<profileName>
, a Databricks CLI profile.
Logging Data to Runs
You can log data to runs using the MLflow Python, R, Java, or REST API. This section shows the Python API.
Basic Logging Functions
mlflow.set_tracking_uri()
connects to a tracking URI. You can also set the
MLFLOW_TRACKING_URI
environment variable to have MLflow find a URI from there. In both cases,
the URI can either be a HTTP/HTTPS URI for a remote server, a database connection string, or a
local path to log data to a directory. The URI defaults to mlruns
.
mlflow.tracking.get_tracking_uri()
returns the current tracking URI.
mlflow.create_experiment()
creates a new experiment and returns its ID. Runs can be
launched under the experiment by passing the experiment ID to mlflow.start_run
.
mlflow.set_experiment()
sets an experiment as active. If the experiment does not exist,
creates a new experiment. If you do not specify an experiment in mlflow.start_run()
, new
runs are launched under this experiment.
mlflow.start_run()
returns the currently active run (if one exists), or starts a new run
and returns a mlflow.ActiveRun
object usable as a context manager for the
current run. You do not need to call start_run
explicitly: calling one of the logging functions
with no active run automatically starts a new one.
mlflow.end_run()
ends the currently active run, if any, taking an optional run status.
mlflow.active_run()
returns a mlflow.entities.Run
object corresponding to the
currently active run, if any.
mlflow.log_param()
logs a single key-value param in the currently active run. The key and
value are both strings. Use mlflow.log_params()
to log multiple params at once.
mlflow.log_metric()
logs a single key-value metric. The value must always be a number.
MLflow remembers the history of values for each metric. Use mlflow.log_metrics()
to log
multiple metrics at once.
mlflow.set_tag()
sets a single key-value tag in the currently active run. The key and
value are both strings. Use mlflow.set_tags()
to set multiple tags at once.
mlflow.log_artifact()
logs a local file as an artifact, optionally taking an
artifact_path
to place it in within the run’s artifact URI. Run artifacts can be organized into
directories, so you can place the artifact in a directory this way.
mlflow.log_artifacts()
logs all the files in a given directory as artifacts, again taking
an optional artifact_path
.
mlflow.get_artifact_uri()
returns the URI that artifacts from the current run should be
logged to.
Launching Multiple Runs in One Program
Sometimes you want to launch multiple MLflow runs in the same program: for example, maybe you are
performing a hyperparameter search locally or your experiments are just very fast to run. This is
easy to do because the ActiveRun
object returned by mlflow.start_run()
is a Python
context manager. You can “scope” each run to
just one block of code as follows:
with mlflow.start_run():
mlflow.log_param("x", 1)
mlflow.log_metric("y", 2)
...
The run remains open throughout the with
statement, and is automatically closed when the
statement exits, even if it exits due to an exception.
Organizing Runs in Experiments
MLflow allows you to group runs under experiments, which can be useful for comparing runs intended
to tackle a particular task. You can create experiments using the Command-Line Interface (mlflow experiments
) or
the mlflow.create_experiment()
Python API. You can pass the experiment ID for a individual run
using the CLI (for example, mlflow run ... --experiment-id [ID]
) or the MLFLOW_EXPERIMENT_ID
environment variable.
# Prints "created an experiment with ID <id>
mlflow experiments create fraud-detection
# Set the ID via environment variables
export MLFLOW_EXPERIMENT_ID=<id>
# Launch a run. The experiment ID is inferred from the MLFLOW_EXPERIMENT_ID environment
# variable, or from the --experiment-id parameter passed to the MLflow CLI (the latter
# taking precedence)
with mlflow.start_run():
mlflow.log_param("a", 1)
mlflow.log_metric("b", 2)
Managing Experiments and Runs with the Tracking Service API
MLflow provides a more detailed Tracking Service API for managing experiments and runs directly,
which is available through client SDK in the mlflow.tracking
module.
This makes it possible to query data about past runs, log additional information about them, create experiments,
add tags to a run, and more.
Example
from mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.list_experiments() # returns a list of mlflow.entities.Experiment
run = client.create_run(experiments[0].experiment_id) # returns mlflow.entities.Run
client.log_param(run.info.run_uuid, "hello", "world")
client.set_terminated(run.info.run_uuid)
Adding Tags to Runs
The mlflow.tracking.MlflowClient.set_tag()
function lets you add custom tags to runs. For example:
client.set_tag(run.info.run_uuid, "tag_key", "tag_value")
Important
Do not use the prefix mlflow
for a tag. This prefix is reserved for use by MLflow.
Tracking UI
The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or
metadata for analysis in other tools. If you have been logging runs to a local mlruns
directory,
run mlflow ui
in the directory above it, and it loads the corresponding runs.
Alternatively, the MLflow tracking server serves the same UI and enables remote storage of run artifacts.
The UI contains the following key features:
- Experiment-based run listing and comparison
- Searching for runs by parameter or metric value
- Visualizing run metrics
- Downloading run results
Querying Runs Programmatically
All of the functions in the Tracking UI can be accessed programmatically. This makes it easy to do several common tasks:
- Query and compare runs using any data analysis tool of your choice, for example, pandas.
- Determine the artifact URI for a run to feed some of its artifacts into a new run when executing a workflow. For an example of querying runs and constructing a multistep workflow, see the MLflow Multistep Workflow Example project.
- Load artifacts from past runs as MLflow Models. For an example of training, exporting, and loading a model, and predicting using the model, see the MLFlow TensorFlow example.
- Run automated parameter search algorithms, where you query the metrics from various runs to submit new ones. For an example of running automated parameter search algorithms, see the MLflow Hyperparameter Tuning Example project.
MLflow Tracking Servers
In this section:
You run an MLflow tracking server using mlflow server
. An example configuration for a server is:
mlflow server \
--backend-store-uri /mnt/persistent-disk \
--default-artifact-root s3://my-mlflow-bucket/ \
--host 0.0.0.0
Storage
An MLflow tracking server has two components for storage: a backend store and an artifact store.
The backend store is where MLflow Tracking Server stores experiment and run metadata as well as params, metrics, and tags for runs. MLflow supports two types of backend stores: file store and database-backed store.
Use --backend-store-uri
to configure type of backend store. This can be a local path file
store specified as ./path_to_store
or file:/path_to_store
, or a SQL connection string
for a database-backed store. For the latter, the argument must be a SQL connection string
specified as db_type://<user_name>:<password>@<host>:<port>/<database_name>
. Supported
database types are mysql
, mssql
, sqlite
, and postgresql
.
By default --backend-store-uri
is set to the local ./mlruns
directory (the same as when
running mlflow run
locally), but when running a server, make sure that this points to a
persistent (that is, non-ephemeral) file system location.
Note
For backwards compatibility, --file-store
is an alias for this option.
The artifact store is a location suitable for large data (such as an S3 bucket or shared NFS
file system) and is where clients log their artifact output (for example, models).
artifact_location
is a property recorded on mlflow.entities.Experiment
for
default location to store artifacts for all runs in this experiment. Additional, artifact_uri
is a property on mlflow.entities.RunInfo
to indicate location where all artifacts for
this run are stored.
Use --default-artifact-root
(defaults to local ./mlruns
directory) to configure default
location to server’s artifact store. This will be used as artifact location for newly-created
experiments that do not specify one. Once you create an experiment, --default-artifact-root
is no longer relevant to that experiment.
To allow the server and clients to access the artifact location, you should configure your cloud
provider credentials as normal. For example, for S3, you can set the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables, use an IAM role, or configure a default
profile in ~/.aws/credentials
.
See Set up AWS Credentials and Region for Development for more info.
Important
If you do not specify a --default-artifact-root
or an artifact URI when creating the experiment
(for example, mlflow experiments create --artifact-location s3://<my-bucket>
), the artifact root
is a path inside the file store. Typically this is not an appropriate location, as the client and
server probably refer to different physical locations (that is, the same path on different disks).
Supported Artifact Stores
In addition to local file paths, MLflow supports the following storage systems as artifact stores: Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP server, and NFS.
Amazon S3
To store artifacts in S3, specify a URI of the form s3://<bucket>/<path>
. MLflow obtains
credentials to access S3 from your machine’s IAM role, a profile in ~/.aws/credentials
, or
the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
depending on which of
these are available. For more information on how to set credentials, see
Set up AWS Credentials and Region for Development.
To store artifacts in a custom endpoint, set the MLFLOW_S3_ENDPOINT_URL
to your endpoint’s URL.
For example, if you have a Minio server at 1.2.3.4 on port 9000:
export MLFLOW_S3_ENDPOINT_URL=http://1.2.3.4:9000
Azure Blob Storage
To store artifacts in Azure Blob Storage, specify a URI of the form
wasbs://<container>@<storage-account>.blob.core.windows.net/<path>
.
MLflow expects Azure Storage access credentials in the
AZURE_STORAGE_CONNECTION_STRING
or AZURE_STORAGE_ACCESS_KEY
environment variables (preferring
a connection string if one is set), so you must set one of these variables on both your client
application and your MLflow tracking server. Finally, you must run pip install azure-storage
separately (on both your client and the server) to access Azure Blob Storage; MLflow does not declare
a dependency on this package by default.
Google Cloud Storage
To store artifacts in Google Cloud Storage, specify a URI of the form gs://<bucket>/<path>
.
You should configure credentials for accessing the GCS container on the client and server as described
in the GCS documentation.
Finally, you must run pip install google-cloud-storage
(on both your client and the server)
to access Google Cloud Storage; MLflow does not declare a dependency on this package by default.
FTP server
Specify a URI of the form ftp://user@host/path/to/directory to store artifacts in a FTP server.
The URI may optionally include a password for logging into the server, e.g. ftp://user:pass@host/path/to/directory
SFTP Server
To store artifacts in an SFTP server, specify a URI of the form sftp://user@host/path/to/directory
.
You should configure the client to be able to log in to the SFTP server without a password over SSH (e.g. public key, identity file in ssh_config, etc.).
The format sftp://user:pass@host/
is supported for logging in. However, for safety reasons this is not recommended.
When using this store, pysftp
must be installed on both the server and the client. Run pip install pysftp
to install the required package.
Networking
The --host
option exposes the service on all interfaces. If running a server in production, we
would recommend not exposing the built-in server broadly (as it is unauthenticated and unencrypted),
and instead putting it behind a reverse proxy like NGINX or Apache httpd, or connecting over VPN.
Additionally, you should ensure that the --backend-store-uri
(which defaults to the
./mlruns
directory) points to a persistent (non-ephemeral) disk or database connection.
Logging to a Tracking Server
To log to a tracking server, set the MLFLOW_TRACKING_URI
environment variable to the server’s URI,
along with its scheme and port (for example, http://10.0.0.1:5000
) or call mlflow.set_tracking_uri()
.
The mlflow.start_run()
, mlflow.log_param()
, and mlflow.log_metric()
calls
then make API requests to your remote tracking server.
import mlflow
with mlflow.start_run():
mlflow.log_param("a", 1)
mlflow.log_metric("b", 2)