MLflow Evaluation
This documentation covers MLflow's classic evaluation system (mlflow.evaluate) which uses EvaluationMetric and make_metric for custom metrics.
For GenAI/LLM evaluation, please use the system at GenAI Evaluation which uses:
mlflow.genai.evaluate()instead ofmlflow.evaluate()Scorerobjects instead ofEvaluationMetric- Built-in LLM judges and scorers
Important: These two systems are not interoperable. EvaluationMetric objects cannot be used with mlflow.genai.evaluate(), and Scorer objects cannot be used with mlflow.evaluate().
Introduction
Model evaluation is the cornerstone of reliable machine learning, transforming trained models into trustworthy, production-ready systems. MLflow's comprehensive evaluation framework goes beyond simple accuracy metrics, providing deep insights into model behavior, performance characteristics, and real-world readiness through automated testing, visualization, and validation pipelines.
MLflow's evaluation capabilities democratize advanced model assessment, making sophisticated evaluation techniques accessible to teams of all sizes. From rapid prototyping to enterprise deployment, MLflow evaluation ensures your models meet the highest standards of reliability, fairness, and performance.
Why Comprehensive Model Evaluation Matters
Beyond Basic Metrics
- 📊 Holistic Assessment: Performance metrics, visualizations, and explanations in one unified framework
- 🎯 Task-Specific Evaluation: Specialized evaluators for classification, regression, and LLM tasks
- 🔍 Model Interpretability: SHAP integration for understanding model decisions and feature importance
- ⚖️ Fairness Analysis: Bias detection and ethical AI validation across demographic groups
Production Readiness
- 🚀 Automated Validation: Threshold-based model acceptance with customizable criteria
- 📈 Performance Monitoring: Track model degradation and drift over time
- 🔄 A/B Testing Support: Compare candidate models against production baselines
- 📋 Audit Trails: Complete evaluation history for regulatory compliance and model governance
Why MLflow Evaluation?
MLflow's evaluation framework provides a comprehensive solution for model assessment and validation:
- ⚡ One-Line Evaluation: Comprehensive model assessment with
mlflow.evaluate()- minimal configuration required - 🎛️ Flexible Evaluation Modes: Evaluate models, functions, or static datasets with the same unified API
- 📊 Rich Visualizations: Automatic generation of performance plots, confusion matrices, and diagnostic charts
- 🔧 Custom Metrics: Define domain-specific evaluation criteria with easy-to-use metric builders
- 🧠 Built-in Explainability: SHAP integration for model interpretation and feature importance analysis
- 👥 Team Collaboration: Share evaluation results and model comparisons through MLflow's tracking interface
- 🏭 Enterprise Integration: Plugin architecture for specialized evaluation frameworks like Giskard and Trubrics
Core Evaluation Capabilities
Automated Model Assessment
MLflow evaluation transforms complex model assessment into simple, reproducible workflows:
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
# Load and prepare data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test
eval_data["target"] = y_test
with mlflow.start_run():
# Log model
mlflow.sklearn.log_model(model, name="model")
# Comprehensive evaluation with one line
result = mlflow.models.evaluate(
model="models:/my-model/1",
data=eval_data,
targets="target",
model_type="classifier",
evaluators=["default"],
)
What Gets Automatically Generated
Performance Metrics
- 📊 Classification: Accuracy, precision, recall, F1-score, ROC-AUC, confusion matrices
- 📈 Regression: MAE, MSE, RMSE, R², residual analysis, prediction vs actual plots