mlflow.pyfunc

The mlflow.pyfunc module defines a generic filesystem format for Python models and provides utilities for saving to and loading from this format. The format is self contained in the sense that it includes all necessary information for anyone to load it and use it. Dependencies are either stored directly with the model or referenced via a Conda environment.

The convention for pyfunc models is to have a predict method or function with the following signature:

predict(data: pandas.DataFrame) -> numpy.ndarray | pandas.Series | pandas.DataFrame

This convention is relied on by other MLflow components.

Pyfunc model format is defined as a directory structure containing all required data, code, and configuration:

./dst-path/
    ./MLmodel: configuration
    <code>: code packaged with the model (specified in the MLmodel file)
    <data>: data packaged with the model (specified in the MLmodel file)
    <env>: Conda environment definition (specified in the MLmodel file)

A Python model contains an MLmodel file in “python_function” format in its root with the following parameters:

  • loader_module [required]:

    Python module that can load the model. Expected as module identifier e.g. mlflow.sklearn, it will be imported via importlib.import_module. The imported module must contain function with the following signature:

    _load_pyfunc(path: string) -> <pyfunc model>
    

    The path argument is specified by the data parameter and may refer to a file or directory.

  • code [optional]:

    Relative path to a directory containing the code packaged with this model. All files and directories inside this directory are added to the Python path prior to importing the model loader.

  • data [optional]:

    Relative path to a file or directory containing model data. The path is passed to the model loader.

  • env [optional]:

    Relative path to an exported Conda environment. If present this environment should be activated prior to running the model.

Example

>>> tree example/sklearn_iris/mlruns/run1/outputs/linear-lr
├── MLmodel
├── code
│   ├── sklearn_iris.py
│
├── data
│   └── model.pkl
└── mlflow_env.yml
>>> cat example/sklearn_iris/mlruns/run1/outputs/linear-lr/MLmodel
python_function:
  code: code
  data: data/model.pkl
  loader_module: mlflow.sklearn
  env: mlflow_env.yml
  main: sklearn_iris
mlflow.pyfunc.add_to_model(model, loader_module, data=None, code=None, env=None)

Add a pyfunc spec to the model configuration.

Defines pyfunc configuration schema. Caller can use this to create a valid pyfunc model flavor out of an existing directory structure. For example, other model flavors can use this to specify how to use their output as a pyfunc.

Note

All paths are relative to the exported model root directory.

Parameters:
  • model – Existing model.
  • loader_module – The module to be used to load the model.
  • data – Path to the model data.
  • code – Path to the code dependencies.
  • env – Conda environment.
Returns:

Updated model configuration.

mlflow.pyfunc.get_module_loader_src(src_path, dst_path)

Generate Python source of the model loader.

Model loader contains load_pyfunc method with no parameters. It hardcodes model loading of the given model into a Python source. This is done so that the exported model has no unnecessary dependencies on MLflow or any other configuration file format or parsing library.

Parameters:
  • src_path – Current path to the model.
  • dst_path – Relative or absolute path where the model will be stored in the deployment environment.
Returns:

Python source code of the model loader as string.

mlflow.pyfunc.load_pyfunc(path, run_id=None, suppress_warnings=False)

Load a model stored in Python function format.

Parameters:
  • path – Path to the model.
  • run_id – MLflow run ID.
  • suppress_warnings – If True, non-fatal warning messages associated with the model loading process will be suppressed. If False, these warning messages will be emitted.
mlflow.pyfunc.log_model(artifact_path, **kwargs)

Export model in Python function form and log it with current MLflow tracking service.

Model is exported by calling save_model() and logging the result with mlflow.tracking.log_artifacts().

mlflow.pyfunc.save_model(dst_path, loader_module, data_path=None, code_path=(), conda_env=None, model=<mlflow.models.Model object>)

Export model as a generic Python function model.

Parameters:
  • dst_path – Path where the model is stored.
  • loader_module – The module to be used to load the model.
  • data_path – Path to a file or directory containing model data.
  • code_path – List of paths (file or dir) contains code dependencies not present in the environment. Every path in the code_path is added to the Python path before the model is loaded.
  • conda_env – Path to the Conda environment definition. This environment is activated prior to running model code.
Returns:

Model configuration containing model info.

mlflow.pyfunc.spark_udf(spark, path, run_id=None, result_type='double')

A Spark UDF that can be used to invoke the Python function formatted model.

Parameters passed to the UDF are forwarded to the model as a DataFrame where the names are ordinals (0, 1, …).

The predictions are filtered to contain only the columns that can be represented as the result_type. If the result_type is string or array of strings, all predictions are converted to string. If the result type is not an array type, the left most column with matching type will be returned.

>>> predict = mlflow.pyfunc.spark_udf(spark, "/my/local/model")
>>> df.withColumn("prediction", predict("name", "age")).show()
Parameters:
  • spark – A SparkSession object.
  • path – A path containing a mlflow.pyfunc model.
  • run_id – ID of the run that produced this model. If provided, run_id is used to retrieve the model logged with MLflow.
  • result_type

    the return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string. Only a primitive type or an array (pyspark.sql.types.ArrayType) of primitive types are allowed. The following classes of result type are supported: - “int” or pyspark.sql.types.IntegerType: The leftmost integer that can fit

    in int32 result is returned or exception is raised if there is none.
    • ”long” or pyspark.sql.types.LongType: The leftmost long integer that can fit in int64 result is returned or exception is raised if there is none.
    • ArrayType(IntegerType|LongType): Return all integer columns that can fit into the requested size.
    • ”float” or pyspark.sql.types.FloatType: The leftmost numeric result cast to float32 is returned or exception is raised if there is none.
    • ”double” or pyspark.sql.types.DoubleType: The leftmost numeric result cast to double is returned or exception is raised if there is none..
    • ArrayType(FloatType|DoubleType): Return all numeric columns cast to the requested type. Exception is raised if there are no numeric columns.
    • ”string” or pyspark.sql.types.StringType: Result is the leftmost column converted to string.
    • ArrayType(StringType): Return all columns converted to string.
Returns:

Spark UDF which will apply model’s prediction method to the data. Default double.