π€ Transformers within MLflow
The transformers
model flavor enables logging of transformers models, components, and pipelines
in MLflow format via the mlflow.transformers.save_model()
and mlflow.transformers.log_model()
functions. Use of these
functions also adds the python_function
flavor to the MLflow Models that they produce, allowing the model to be
interpreted as a generic Python function for inference via mlflow.pyfunc.load_model()
.
You can also use the mlflow.transformers.load_model()
function to load a saved or logged MLflow
Model with the transformers
flavor in the native transformers formats.
This page explains the detailed features and configurations of the MLflow transformers
flavor. For the general introduction about the MLflow's Transformer integration,
please refer to the MLflow Transformers Flavor page.
- Loading a Transformers Model as a Python Function
- Saving Prompt Templates with Transformer Pipelines
- Using model_config and Model Signature Params for Inference
- Pipelines vs. Component Logging
- Automatic Metadata and ModelCard logging
- Automatic Signature inference
- Scale Inference with Overriding Pytorch dtype
- Input Data Types for Audio Pipelines
- PEFT Models in MLflow Transformers flavor
Loading a Transformers Model as a Python Functionβ
Supported Transformers Pipeline typesβ
The transformers
python_function (pyfunc) model flavor simplifies
and standardizes both the inputs and outputs of pipeline inference. This conformity allows for serving
and batch inference by coercing the data structures that are required for transformers
inference pipelines
to formats that are compatible with json serialization and casting to Pandas DataFrames.
Certain TextGenerationPipeline types, particularly instructional-based ones, may return the original prompt and included line-formatting carriage returns "\n" in their outputs. For these pipeline types, if you would like to disable the prompt return, you can set the following in the model_config dictionary when saving or logging the model: "include_prompt": False. To remove the newline characters from within the body of the generated text output, you can add the "collapse_whitespace": True option to the model_config dictionary. If the pipeline type being saved does not inherit from TextGenerationPipeline, these options will not perform any modification to the output returned from pipeline inference.
In the current version, audio and text-based large language
models are supported for use with pyfunc
, while computer vision, multi-modal, timeseries,
reinforcement learning, and graph models are only supported for native type loading via mlflow.transformers.load_model()
Not all transformers
pipeline types are supported. See the table below for the list of currently supported Pipeline
types that can be loaded as pyfunc
.
Future releases of MLflow will introduce pyfunc
support for these additional types.
The table below shows the mapping of transformers
pipeline types to the python_function (pyfunc) model flavor
data type inputs and outputs.
The inputs and outputs of the pyfunc
implementation of these pipelines are not guaranteed to match the input types and output types that would
return from a native use of a given pipeline type. If your use case requires access to scores, top_k results, or other additional references within
the output from a pipeline inference call, please use the native implementation by loading via mlflow.transformers.load_model()
to
receive the full output.
Similarly, if your use case requires the use of raw tensor outputs or processing of outputs through an external processor
module, load the
model components directly as a dict
by calling mlflow.transformers.load_model()
and specify the return_type
argument as 'components'.
Pipeline Type | Input Type | Output Type |
---|---|---|
Instructional Text Generation | str or List[str] | List[str] |
Conversational | str or List[str] | List[str] |
Summarization | str or List[str] | List[str] |
Text Classification | str or List[str] | pd.DataFrame (dtypes: {'label': str, 'score': double}) |
Text Generation | str or List[str] | List[str] |
Text2Text Generation | str or List[str] | List[str] |
Token Classification | str or List[str] | List[str] |
Translation | str or List[str] | List[str] |
ZeroShot Classification* | Dict[str, [List[str] | str]* | pd.DataFrame (dtypes: {'sequence': str, 'labels': str, 'scores': double}) |
Table Question Answering** | Dict[str, [List[str] | str]** | List[str] |
Question Answering*** | Dict[str, str]*** | List[str] |
Fill Mask**** | str or List[str]**** | List[str] |
Feature Extraction | str or List[str] | np.ndarray |
AutomaticSpeechRecognition | bytes*****, str, or np.ndarray | List[str] |
AudioClassification | bytes*****, str, or np.ndarray | pd.DataFrame (dtypes: {'label': str, 'score': double}) |
* A collection of these inputs can also be passed. The standard required key names are 'sequences' and 'candidate_labels', but these may vary. Check the input requirements for the architecture that you're using to ensure that the correct dictionary key names are provided.
** A collection of these inputs can also be passed. The reference table must be a json encoded dict (i.e. {'query': 'what did we sell most of?', 'table': json.dumps(table_as_dict)})
*** A collection of these inputs can also be passed. The standard required key names are 'question' and 'context'. Verify the expected input key names match the expected input to the model to ensure your inference request can be read properly.
**** The mask syntax for the model that you've chosen is going to be specific to that model's implementation. Some are '[MASK]', while others are '<mask>'. Verify the expected syntax to avoid failed inference requests.
***** If using pyfunc in MLflow Model Serving for realtime inference, the raw audio in bytes format must be base64 encoded prior to submitting to the endpoint. String inputs will be interpreted as uri locations.
Example of loading a transformers model as a python functionβ
In the below example, a simple pre-trained model is used within a pipeline. After logging to MLflow, the pipeline is
loaded as a pyfunc
and used to generate a response from a passed-in list of strings.
import mlflow
import transformers
# Read a pre-trained conversation pipeline from HuggingFace hub
conversational_pipeline = transformers.pipeline(model="microsoft/DialoGPT-medium")
# Define the signature
signature = mlflow.models.infer_signature(
"Hi there, chatbot!",
mlflow.transformers.generate_signature_output(
conversational_pipeline, "Hi there, chatbot!"
),
)
# Log the pipeline
with mlflow.start_run():
model_info = mlflow.transformers.log_model(
transformers_model=conversational_pipeline,
name="chatbot",
task="conversational",
signature=signature,
input_example="A clever and witty question",
)
# Load the saved pipeline as pyfunc
chatbot = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
# Ask the chatbot a question
response = chatbot.predict("What is machine learning?")
print(response)
# >> [It's a new thing that's been around for a while.]
Saving Prompt Templates with Transformer Pipelinesβ
This feature is only available in MLflow 2.10.0 and above.
MLflow supports specifying prompt templates for certain pipeline types:
Prompt templates are strings that are used to format user inputs prior to pyfunc
inference. To specify a prompt template,
use the prompt_template
argument when calling mlflow.transformers.save_model()
or mlflow.transformers.log_model()
.
The prompt template must be a string with a single format placeholder, {prompt}
.
For example:
import mlflow
from transformers import pipeline
# Initialize a pipeline. `distilgpt2` uses a "text-generation" pipeline
generator = pipeline(model="distilgpt2")
# Define a prompt template
prompt_template = "Answer the following question: {prompt}"
# Save the model
mlflow.transformers.save_model(
transformers_model=generator,
path="path/to/model",
prompt_template=prompt_template,
)
When the model is then loaded with mlflow.pyfunc.load_model()
, the prompt
template will be used to format user inputs before passing them into the pipeline:
import mlflow
# Load the model with pyfunc
model = mlflow.pyfunc.load_model("path/to/model")
# The prompt template will be used to format this input, so the
# string that is passed to the text-generation pipeline will be:
# "Answer the following question: What is MLflow?"
model.predict("What is MLflow?")
text-generation
pipelines with a prompt template will have the
return_full_text pipeline argument
set to False
by default. This is to prevent the template from being shown to the users,
which could potentially cause confusion as it was not part of their original input. To
override this behaviour, either set return_full_text
to True
via params
, or by
including it in a model_config
dict in log_model()
. See this section
for more details on how to do this.
For a more in-depth guide, check out the Prompt Templating notebook!
Using model_config and Model Signature Params for Inferenceβ
For transformers inference, there are two ways to pass in additional arguments to the pipeline.
- Use
model_config
when saving/logging the model. Optionally, specifymodel_config
when callingload_model
. - Specify params at inference time when calling
predict()
Use model_config
to control how the model is loaded and inference performed for all input samples. Configuration in
model_config
is not overridable at predict()
time unless a ModelSignature
is indicated with the same parameters.
Use ModelSignature
with params schema, on the other hand, to allow downstream consumers to provide additional inference
params that may be needed to compute the predictions for their specific samples.
If both model_config
and ModelSignature
with parameters are saved when logging model, both of them
will be used for inference. The default parameters in ModelSignature
will override the params in model_config
.
If extra params
are provided at inference time, they take precedence over all params. We recommend using
model_config
for those parameters needed to run the model in general for all the samples. Then, add
ModelSignature
with parameters for those extra parameters that you want downstream consumers to indicated at
per each of the samples.
- Using
model_config
import mlflow
from mlflow.models import infer_signature
from mlflow.transformers import generate_signature_output
import transformers
architecture = "mrm8488/t5-base-finetuned-common_gen"
model = transformers.pipeline(
task="text2text-generation",
tokenizer=transformers.T5TokenizerFast.from_pretrained(architecture),
model=transformers.T5ForConditionalGeneration.from_pretrained(architecture),
)
data = "pencil draw paper"
# Infer the signature
signature = infer_signature(
data,
generate_signature_output(model, data),
)
# Define an model_config
model_config = {
"num_beams": 5,
"max_length": 30,
"do_sample": True,
"remove_invalid_values": True,
}
# Saving model_config with the model
mlflow.transformers.save_model(
model,
path="text2text",
model_config=model_config,
signature=signature,
)
pyfunc_loaded = mlflow.pyfunc.load_model("text2text")
# model_config will be applied
result = pyfunc_loaded.predict(data)
# overriding some inference configuration with different values
pyfunc_loaded = mlflow.pyfunc.load_model(
"text2text", model_config=dict(do_sample=False)
)
Note that in the previous example, the user can't override the configuration do_sample
when calling predict
.
- Specifying params at inference time
import mlflow
from mlflow.models import infer_signature
from mlflow.transformers import generate_signature_output
import transformers
architecture = "mrm8488/t5-base-finetuned-common_gen"
model = transformers.pipeline(
task="text2text-generation",
tokenizer=transformers.T5TokenizerFast.from_pretrained(architecture),
model=transformers.T5ForConditionalGeneration.from_pretrained(architecture),
)
data = "pencil draw paper"
# Define an model_config
model_config = {
"num_beams": 5,
"remove_invalid_values": True,
}
# Define the inference parameters params
inference_params = {
"max_length": 30,
"do_sample": True,
}
# Infer the signature including params
signature_with_params = infer_signature(
data,
generate_signature_output(model, data),
params=inference_params,
)
# Saving model with signature and model config
mlflow.transformers.save_model(
model,
path="text2text",
model_config=model_config,
signature=signature_with_params,
)
pyfunc_loaded = mlflow.pyfunc.load_model("text2text")
# Pass params at inference time
params = {
"max_length": 20,
"do_sample": False,
}
# In this case we only override max_length and do_sample,
# other params will use the default one saved on ModelSignature
# or in the model configuration.
# The final params used for prediction is as follows:
# {
# "num_beams": 5,
# "max_length": 20,
# "do_sample": False,
# "remove_invalid_values": True,
# }
result = pyfunc_loaded.predict(data, params=params)
Pipelines vs. Component Loggingβ
The transformers flavor has two different primary mechanisms for saving and loading models: pipelines and components.
Saving transformers models with custom code (i.e. models that require trust_remote_code=True
) requires transformers >= 4.26.0
.
Pipelines
Pipelines, in the context of the Transformers library, are high-level objects that combine pre-trained models and tokenizers (as well as other components, depending on the task type) to perform a specific task. They abstract away much of the preprocessing and postprocessing work involved in using the models.
For example, a text classification pipeline would handle the tokenization of text, passing the tokens through a model, and then interpret the logits to produce a human-readable classification.
When logging a pipeline with MLflow, you're essentially saving this high-level abstraction, which can be loaded and used directly for inference with minimal setup. This is ideal for end-to-end tasks where the preprocessing and postprocessing steps are standard for the task at hand.
Components
Components refer to the individual parts that can make up a pipeline, such as the model itself, the tokenizer, and any additional processors, extractors, or configuration needed for a specific task. Logging components with MLflow allows for more flexibility and customization. You can log individual components when your project needs to have more control over the preprocessing and postprocessing steps or when you need to access the individual components in a bespoke manner that diverges from how the pipeline abstraction would call them.
For example, you might log the components separately if you have a custom tokenizer or if you want to apply some special postprocessing to the model outputs. When loading the components, you can then reconstruct the pipeline with your custom components or use the components individually as needed.
MLflow by default uses a 500 MB max_shard_size
to save the model object in mlflow.transformers.save_model()
or mlflow.transformers.log_model()
APIs.
You can use the environment variable MLFLOW_HUGGINGFACE_MODEL_MAX_SHARD_SIZE to override the value.
For component-based logging, the only requirement that must be met in the submitted dict
is that a model is provided. All other elements of the dict
are optional.