Model Evaluation
This documentation covers MLflow's classic evaluation system (mlflow.models.evaluate) which uses EvaluationMetric and make_metric for custom metrics.
For GenAI/LLM evaluation, please use the system at GenAI Evaluation which uses:
mlflow.genai.evaluate()instead ofmlflow.models.evaluate()Scorerobjects instead ofEvaluationMetric- Built-in LLM judges and scorers
Important: These two systems are not interoperable. EvaluationMetric objects cannot be used with mlflow.genai.evaluate(), and Scorer objects cannot be used with mlflow.models.evaluate().
Introduction
MLflow's evaluation framework provides automated model assessment for classification and regression tasks. It generates performance metrics, visualizations, and diagnostic information through a unified API.
Unified Evaluation API
Evaluate models, Python functions, or static datasets with mlflow.models.evaluate() using a consistent interface across different evaluation modes.
Automated Metrics & Visualizations
Generate task-specific metrics and plots automatically, including confusion matrices, ROC curves, and feature importance with SHAP integration.
Custom Metrics
Define domain-specific evaluation criteria with make_metric() for business-specific performance measures beyond standard ML metrics.
Plugin Architecture
Extend evaluation with specialized frameworks like Giskard and Trubrics for advanced validation and vulnerability scanning.
Model Evaluation
Evaluate classification and regression models with automated metrics and visualizations.
Quick Start
import mlflow
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from mlflow.models import infer_signature
# Load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = xgb.XGBClassifier().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test
with mlflow.start_run():
# Log model
signature = infer_signature(X_test, model.predict(X_test))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
# Evaluate
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="label",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
This automatically generates performance metrics (accuracy, precision, recall, F1-score, ROC-AUC), visualizations (confusion matrix, ROC curve, precision-recall curve), and saves all artifacts to MLflow.
Model Types
- Classification
- Regression
For classification tasks:
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
)
# Access metrics
print(f"Precision: {result.metrics['precision_score']:.3f}")
print(f"Recall: {result.metrics['recall_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
Automatically generates: accuracy, precision, recall, F1-score, ROC-AUC, precision-recall AUC, log loss, brier score, confusion matrix, and classification report.
For regression tasks:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Load regression dataset
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test
with mlflow.start_run():
signature = infer_signature(X_train, model.predict(X_train))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="target",
model_type="regressor",
)
print(f"MAE: {result.metrics['mean_absolute_error']:.3f}")
print(f"RMSE: {result.metrics['root_mean_squared_error']:.3f}")
print(f"R² Score: {result.metrics['r2_score']:.3f}")
Automatically generates: MAE, MSE, RMSE, R² score, adjusted R², MAPE, residual plots, and distribution analysis.
Evaluator Configuration
Control evaluator behavior with the evaluator_config parameter:
# Include SHAP explainer for feature importance
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={
"log_explainer": True,
"explainer_type": "exact",
},
)
Common options: log_explainer (log SHAP explainer), explainer_type (SHAP type: "exact", "permutation", "partition"), pos_label (positive class label for binary classification), average (averaging strategy for multiclass: "macro", "micro", "weighted").
Evaluation Results
Access metrics, artifacts, and evaluation data:
# Run evaluation
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)
# Access metrics
for metric_name, value in result.metrics.items():
print(f"{metric_name}: {value}")
# Access artifacts (plots, tables)
for artifact_name, path in result.artifacts.items():
print(f"{artifact_name}: {path}")
# Access evaluation table
eval_table = result.tables["eval_results_table"]
Model Validation
MLflow 2.18.0 moved model validation from mlflow.models.evaluate() to mlflow.validate_evaluation_results().
Validate evaluation metrics against thresholds:
from mlflow.models import MetricThreshold
# Evaluate model
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)
# Define thresholds
thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"precision_score": MetricThreshold(threshold=0.80, greater_is_better=True),
}
# Validate
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=thresholds,
)
print("Model meets all thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"Validation failed: {e}")
Dataset Evaluation
Evaluate pre-computed predictions without re-running the model.
Usage
import mlflow
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate sample data and train a model
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Generate predictions
predictions = model.predict(X_test)
prediction_probabilities = model.predict_proba(X_test)[:, 1]
# Create evaluation dataset with predictions
eval_dataset = pd.DataFrame(
{
"prediction": predictions,
"target": y_test,
}
)
with mlflow.start_run():
result = mlflow.models.evaluate(
data=eval_dataset,
predictions="prediction",
targets="target",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
Parameters
data: DataFrame containing predictions and targetspredictions: Column name containing model predictionstargets: Column name containing ground truth labelsmodel_type: Task type ("classifier"or"regressor")
When evaluating classification models with probability scores, include a column with probabilities for metrics like ROC-AUC:
eval_dataset = pd.DataFrame(
{
"prediction": predictions,
"prediction_proba": prediction_probabilities, # For ROC-AUC
"target": y_test,
}
)
Function Evaluation
Evaluate Python functions directly without logging models to MLflow.
Usage
import mlflow
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Define a prediction function
def predict_function(input_data):
return model.predict(input_data)
# Create evaluation dataset
eval_data = pd.DataFrame(X_test)
eval_data["target"] = y_test
with mlflow.start_run():
result = mlflow.models.evaluate(
predict_function,
eval_data,
targets="target",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
Function Requirements
The function must:
- Accept input data as its first parameter (DataFrame, numpy array, or compatible format)
- Return predictions in a format compatible with the specified
model_type - Be callable without additional arguments beyond the input data
For classification tasks, the function should return class predictions. For regression tasks, it should return continuous values.
Custom Metrics & Visualizations
Define custom evaluation metrics and create specialized visualizations.
Custom Metrics
The make_metric function is part of MLflow's classic evaluation system.
For GenAI/LLM custom metrics, use the @scorer decorator instead.
Create custom metrics with make_metric:
import mlflow
import numpy as np
from mlflow.models import make_metric
from mlflow.metrics.base import MetricValue
# Define custom metric
def custom_metric_fn(predictions, targets, metrics):
"""Custom metric function."""
tp = np.sum((predictions == 1) & (targets == 1))
fp = np.sum((predictions == 1) & (targets == 0))
# Calculate custom value
custom_value = (tp * 100) - (fp * 20)
return MetricValue(
aggregate_results={
"custom_value": custom_value,
"value_per_prediction": custom_value / len(predictions),
},
)
# Create metric
custom_metric = make_metric(
eval_fn=custom_metric_fn, greater_is_better=True, name="custom_metric"
)
with mlflow.start_run():
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="target",
model_type="classifier",
extra_metrics=[custom_metric],
)
print(f"Custom Value: {result.metrics['custom_metric/custom_value']:.2f}")
Custom metric functions receive three parameters:
predictions: Model predictions (numpy array)targets: Ground truth labels (numpy array)metrics: Dictionary of built-in metrics already computed
Return a MetricValue object with aggregate_results dict containing your custom metric values.
Custom Visualizations
Create custom visualization artifacts:
import matplotlib.pyplot as plt
import os
def create_custom_plot(eval_df, builtin_metrics, artifacts_dir):
"""Create custom visualization."""
plt.figure(figsize=(10, 6))
plt.scatter(eval_df["prediction"], eval_df["target"], alpha=0.5)
plt.xlabel("Predictions")
plt.ylabel("Targets")
plt.title("Custom Prediction Analysis")
# Save plot
plot_path = os.path.join(artifacts_dir, "custom_plot.png")
plt.savefig(plot_path)
plt.close()
return {"custom_plot": plot_path}
# Use custom artifact
with mlflow.start_run():
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="target",
model_type="classifier",
custom_artifacts=[create_custom_plot],
)
Custom artifact functions receive three parameters:
eval_df: DataFrame with predictions, targets, and input featuresbuiltin_metrics: Dictionary of computed metricsartifacts_dir: Directory path to save artifact files
Return a dictionary mapping artifact names to file paths.
SHAP Integration
MLflow's built-in SHAP integration provides automatic model explanations and feature importance analysis.
Usage
Enable SHAP explanations by setting log_explainer: True in the evaluator config:
import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from mlflow.models import infer_signature
# Load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = xgb.XGBClassifier().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test
with mlflow.start_run():
signature = infer_signature(X_test, model.predict(X_test))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
# Evaluate with SHAP enabled
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={"log_explainer": True},
)
# Check generated SHAP artifacts
for artifact_name in result.artifacts:
if "shap" in artifact_name.lower():
print(f"Generated: {artifact_name}")
This generates feature importance plots, SHAP summary plots, and saves a SHAP explainer model.
Configuration
Control SHAP behavior with evaluator config options:
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={
"log_explainer": True,
"explainer_type": "exact",
"max_error_examples": 100,
"log_model_explanations": True,
},
)
Configuration Options:
log_explainer: Whether to save the SHAP explainer as a model (default: False)explainer_type: SHAP algorithm type - "exact", "permutation", or "partition"max_error_examples: Number of misclassified examples to explain in detaillog_model_explanations: Whether to log individual prediction explanations
Using Saved Explainers
Load and use saved SHAP explainers on new data:
# Load the saved explainer
explainer_uri = f"runs:/{run_id}/explainer"
explainer = mlflow.pyfunc.load_model(explainer_uri)
# Generate explanations for new data
new_data = X_test[:10]
explanations = explainer.predict(new_data)
# explanations contains SHAP values for each feature and prediction
print(f"Explanations shape: {explanations.shape}")
Plugin Evaluators
MLflow's evaluation framework supports plugin evaluators that extend evaluation with specialized validation capabilities.
Giskard Plugin
The Giskard plugin scans models for vulnerabilities including performance bias, robustness issues, overconfidence, underconfidence, ethical bias, data leakage, stochasticity, and spurious correlations.
Examples:
Documentation: Giskard-MLflow integration docs
Trubrics Plugin
The Trubrics plugin provides a validation framework with pre-built validation checks and support for custom Python validation functions.
Example: Official example notebook
Documentation: Trubrics-MLflow integration docs
API Reference
mlflow.models.evaluate()- Main evaluation APImlflow.validate_evaluation_results()- Validate evaluation resultsmlflow.models.make_metric()- Create custom metricsmlflow.metrics.base.MetricValue()- Metric return value