mlflow.pyfunc module defines a generic filesystem format for Python models and provides
utilities for saving to and loading from this format. The format is self contained in the sense
that it includes all necessary information for anyone to load it and use it. Dependencies
are either stored directly with the model or referenced via a Conda environment.
The convention for pyfunc models is to have a
predict method or function with the following
predict(data: pandas.DataFrame) -> numpy.ndarray | pandas.Series | pandas.DataFrame
This convention is relied on by other MLflow components.
Pyfunc model format is defined as a directory structure containing all required data, code, and configuration:
./dst-path/ ./MLmodel: configuration <code>: code packaged with the model (specified in the MLmodel file) <data>: data packaged with the model (specified in the MLmodel file) <env>: Conda environment definition (specified in the MLmodel file)
A Python model contains an
MLmodel file in “python_function” format in its root with the
- loader_module [required]:
Python module that can load the model. Expected as module identifier e.g.
mlflow.sklearn, it will be imported via
importlib.import_module. The imported module must contain function with the following signature:
_load_pyfunc(path: string) -> <pyfunc model>
The path argument is specified by the
dataparameter and may refer to a file or directory.
- code [optional]:
Relative path to a directory containing the code packaged with this model. All files and directories inside this directory are added to the Python path prior to importing the model loader.
- data [optional]:
Relative path to a file or directory containing model data. The path is passed to the model loader.
- env [optional]:
Relative path to an exported Conda environment. If present this environment should be activated prior to running the model.
>>> tree example/sklearn_iris/mlruns/run1/outputs/linear-lr
├── MLmodel ├── code │ ├── sklearn_iris.py │ ├── data │ └── model.pkl └── mlflow_env.yml
>>> cat example/sklearn_iris/mlruns/run1/outputs/linear-lr/MLmodel
python_function: code: code data: data/model.pkl loader_module: mlflow.sklearn env: mlflow_env.yml main: sklearn_iris
add_to_model(model, loader_module, data=None, code=None, env=None)
Add a pyfunc spec to the model configuration.
Defines pyfunc configuration schema. Caller can use this to create a valid pyfunc model flavor out of an existing directory structure. For example, other model flavors can use this to specify how to use their output as a pyfunc.
All paths are relative to the exported model root directory.
- model – Existing model.
- loader_module – The module to be used to load the model.
- data – Path to the model data.
- code – Path to the code dependencies.
- env – Conda environment.
Updated model configuration.
Generate Python source of the model loader.
Model loader contains
load_pyfuncmethod with no parameters. It hardcodes model loading of the given model into a Python source. This is done so that the exported model has no unnecessary dependencies on MLflow or any other configuration file format or parsing library.
- src_path – Current path to the model.
- dst_path – Relative or absolute path where the model will be stored in the deployment environment.
Python source code of the model loader as string.
load_pyfunc(path, run_id=None, suppress_warnings=False)
Load a model stored in Python function format.
- path – Path to the model.
- run_id – MLflow run ID.
- suppress_warnings – If True, non-fatal warning messages associated with the model loading process will be suppressed. If False, these warning messages will be emitted.
Export model in Python function form and log it with current MLflow tracking service.
Model is exported by calling
save_model()and logging the result with
save_model(dst_path, loader_module, data_path=None, code_path=(), conda_env=None, model=<mlflow.models.Model object>)
Export model as a generic Python function model.
- dst_path – Path where the model is stored.
- loader_module – The module to be used to load the model.
- data_path – Path to a file or directory containing model data.
- code_path – List of paths (file or dir) contains code dependencies not present in
the environment. Every path in the
code_pathis added to the Python path before the model is loaded.
- conda_env – Path to the Conda environment definition. This environment is activated prior to running model code.
Model configuration containing model info.
spark_udf(spark, path, run_id=None, result_type='double')
A Spark UDF that can be used to invoke the Python function formatted model.
Parameters passed to the UDF are forwarded to the model as a DataFrame where the names are ordinals (0, 1, …).
The predictions are filtered to contain only the columns that can be represented as the
result_type. If the
result_typeis string or array of strings, all predictions are converted to string. If the result type is not an array type, the left most column with matching type will be returned.
>>> predict = mlflow.pyfunc.spark_udf(spark, "/my/local/model") >>> df.withColumn("prediction", predict("name", "age")).show()
- spark – A SparkSession object.
- path – A path containing a
- run_id – ID of the run that produced this model. If provided,
run_idis used to retrieve the model logged with MLflow.
- result_type –
the return type of the user-defined function. The value can be either a
pyspark.sql.types.DataTypeobject or a DDL-formatted type string. Only a primitive type or an array (pyspark.sql.types.ArrayType) of primitive types are allowed. The following classes of result type are supported: - “int” or pyspark.sql.types.IntegerType: The leftmost integer that can fitin int32 result is returned or exception is raised if there is none.
- ”long” or pyspark.sql.types.LongType: The leftmost long integer that can fit in int64 result is returned or exception is raised if there is none.
- ArrayType(IntegerType|LongType): Return all integer columns that can fit into the requested size.
- ”float” or pyspark.sql.types.FloatType: The leftmost numeric result cast to float32 is returned or exception is raised if there is none.
- ”double” or pyspark.sql.types.DoubleType: The leftmost numeric result cast to double is returned or exception is raised if there is none..
- ArrayType(FloatType|DoubleType): Return all numeric columns cast to the requested type. Exception is raised if there are no numeric columns.
- ”string” or pyspark.sql.types.StringType: Result is the leftmost column converted to string.
- ArrayType(StringType): Return all columns converted to string.
Spark UDF which will apply model’s prediction method to the data. Default double.