XGBoost with MLflow
In this comprehensive guide, we'll explore how to use XGBoost with MLflow for experiment tracking, model management, and production deployment. We'll cover both the native XGBoost API and scikit-learn compatible interface, from basic autologging to advanced distributed training patterns.
Quick Start with Autologging
The fastest way to get started is with MLflow's XGBoost autologging. Enable comprehensive experiment tracking with a single line:
import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
# Enable autologging for XGBoost
mlflow.xgboost.autolog()
# Load sample data
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Prepare DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Define training parameters
params = {
"objective": "reg:squarederror",
"max_depth": 6,
"learning_rate": 0.1,
"subsample": 0.8,
"colsample_bytree": 0.8,
"random_state": 42,
}
# Train model - MLflow automatically logs everything
with mlflow.start_run():
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=100,
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=10,
verbose_eval=False,
)
print(f"Best iteration: {model.best_iteration}")
print(f"Best score: {model.best_score}")
This simple example automatically logs all XGBoost parameters and training configuration, training and validation metrics for each boosting round, feature importance plots and JSON artifacts, the trained model with proper serialization, and early stopping metrics and best iteration information.
Understanding XGBoost Autologging
- What Gets Logged
- Native vs Scikit-learn API
MLflow's XGBoost autologging captures comprehensive information about your gradient boosting process automatically:
Category | Information Captured |
---|---|
Parameters | All booster parameters, training configuration, callback settings |
Metrics | Training/validation metrics per iteration, early stopping metrics |
Feature Importance | Weight, gain, cover, and total_gain importance with visualizations |
Artifacts | Trained model, feature importance plots, JSON importance data |
The autologging system is designed to be comprehensive yet non-intrusive. It captures everything you need for reproducibility without requiring changes to your existing XGBoost code.
XGBoost offers two main interfaces, and MLflow supports both seamlessly:
# Native XGBoost API - Maximum control and performance
import xgboost as xgb
mlflow.xgboost.autolog()
dtrain = xgb.DMatrix(X_train, label=y_train)
model = xgb.train(params, dtrain, num_boost_round=100)
# Scikit-learn API - Familiar interface with sklearn integration
from xgboost import XGBClassifier
mlflow.sklearn.autolog() # Note: Use sklearn autolog for XGBoost sklearn API
model = XGBClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
Choosing the Right API:
Native XGBoost API - Use when you need maximum performance with direct access to all XGBoost optimizations, advanced features like custom objectives and evaluation metrics, memory efficiency with fine-grained control over data loading, or competition settings where every bit of performance matters.
Scikit-learn API - Use when you need pipeline integration with sklearn preprocessing and feature engineering, hyperparameter tuning using GridSearchCV or RandomizedSearchCV, team familiarity with sklearn patterns, or rapid prototyping with familiar interfaces.