Deploy MLflow Model to Modal
Modal is a serverless cloud platform optimized for AI/ML workloads, offering on-demand GPU access with automatic scaling. The mlflow-modal-deploy plugin enables one-command deployment of MLflow models to Modal's infrastructure.
If you are new to MLflow model deployment, please read MLflow Deployment first to understand the basic concepts of MLflow models and deployments.
How it Works
The plugin automates the deployment process:
- Extract: MLflow model artifacts and dependencies are extracted from the model URI
- Upload: Model files are uploaded to a Modal Volume for persistent storage
- Generate: A Modal app is generated with FastAPI endpoints (
/invocations,/predict_stream) - Deploy: Modal builds a container with all dependencies and deploys to serverless infrastructure
- Serve: An HTTPS endpoint URL is returned, ready to handle prediction requests
The generated container mirrors your training environment, ensuring consistent behavior between development and production. Modal handles auto-scaling, GPU allocation, and container lifecycle management automatically.
Deploying Model to Modal
This section outlines the process of deploying a model to Modal using the MLflow deployment plugin. For Python API references and tutorials, see the Useful links section.
Step 0: Preparation
Install Libraries
Ensure the installation of the following libraries:
pip install mlflow mlflow-modal-deploy modal
Authentication Setup
Configure Modal authentication:
# Interactive setup (recommended)
modal setup
# Or use environment variables
export MODAL_TOKEN_ID=your-token-id
export MODAL_TOKEN_SECRET=your-token-secret
Create an MLflow Model
Before deploying, you must have an MLflow Model. If you don't have one, you can create a sample scikit-learn model
by following the MLflow Tracking Quickstart. Remember to note down the model URI, such as
runs:/<run_id>/model (or models:/<model_name>/<model_version> if you registered the model in the
MLflow Model Registry).
Step 1: Test Your Model Locally
It's recommended to test your model locally before deploying to production. The run_local function
deploys the model using modal serve for local testing:
from mlflow_modal import run_local
run_local(
target_uri="modal",
name="test-model",
model_uri="runs:/<run_id>/model",
config={"gpu": "T4"},
)
This allows you to verify that:
- The model loads correctly with all dependencies
- The inference endpoint responds as expected
- The GPU configuration is valid
Step 2: Deploy to Modal
Once local testing passes, deploy to Modal's cloud infrastructure.
- Python API
- CLI
from mlflow.deployments import get_deploy_client
client = get_deploy_client("modal")
deployment = client.create_deployment(
name="my-classifier",
model_uri="runs:/<run_id>/model",
config={
"gpu": "T4",
"memory": 2048,
"min_containers": 1,
},
)
print(f"Deployed to: {deployment['endpoint_url']}")
# Deploy a model
mlflow deployments create -t modal -m runs:/<run_id>/model --name my-model
# Deploy with GPU and custom configuration
mlflow deployments create -t modal -m runs:/<run_id>/model --name gpu-model \
-C gpu=T4 -C memory=4096 -C min_containers=1
Step 3: Make Predictions
After deployment, you can make predictions using the deployment client:
- Python API
- CLI
from mlflow.deployments import get_deploy_client
client = get_deploy_client("modal")
# Standard predictions
predictions = client.predict(
deployment_name="my-classifier",
inputs={"feature1": [1, 2, 3], "feature2": [4, 5, 6]},
)
# Streaming predictions (for LLM models)
for chunk in client.predict_stream(
deployment_name="my-llm",
inputs={"messages": [{"role": "user", "content": "Hello!"}]},
):
print(chunk, end="", flush=True)
# Make predictions
mlflow deployments predict -t modal --name my-model --input-path input.json
Configuration Options
The following configuration options are available when creating a deployment:
| Option | Type | Default | Description |
|---|---|---|---|
gpu | str/list | None | GPU type: T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200. Supports multi-GPU (H100:8), dedicated (H100!), or fallback list (["H100", "A100"]) |
memory | int | 512 | Memory allocation in MB |
cpu | float | 1.0 | CPU cores |
timeout | int | 300 | Request timeout in seconds |
startup_timeout | int | None | Container startup timeout (useful for large models) |
scaledown_window | int | 60 | Seconds before idle container scales down |
concurrent_inputs | int | 1 | Max concurrent requests per container |
min_containers | int | 0 | Minimum warm containers (set > 0 to avoid cold starts) |
max_containers | int | None | Maximum containers |
enable_batching | bool | False | Enable dynamic request batching |
max_batch_size | int | 8 | Max batch size when batching enabled |
batch_wait_ms | int | 100 | Batch wait time in milliseconds |
extra_pip_packages | list | [] | Additional pip packages to install |
For detailed information on these options, see the Modal documentation:
- GPU configuration - GPU types, multi-GPU, dedicated GPUs
- CPU and memory - Resource allocation
- Timeouts - Request and startup timeouts
- Scaling - Container scaling and warm pools
- Concurrency - Concurrent request handling
Advanced Usage
GPU Selection
Modal supports a wide range of GPU types for different workloads. See Modal's GPU documentation for the full list of available GPUs and configuration options.
# Single GPU
config = {"gpu": "T4"} # Cost-effective for inference
# High-performance GPU
config = {"gpu": "H100"} # Best for large models
# Multi-GPU for large models
config = {"gpu": "H100:8"} # 8x H100 GPUs
# Dedicated GPU (no sharing)
config = {"gpu": "H100!"}
# Fallback list (uses first available)
config = {"gpu": ["H100", "A100", "A10"]}
High-Throughput Deployment
For high-throughput workloads, enable dynamic batching:
client.create_deployment(
name="batch-classifier",
model_uri="runs:/<run_id>/model",
config={
"gpu": "A100",
"enable_batching": True,
"max_batch_size": 32,
"batch_wait_ms": 50,
"min_containers": 2,
"max_containers": 20,
},
)
Deploy to Specific Workspace
Deploy to a specific Modal workspace:
# Use workspace-specific URI
client = get_deploy_client("modal:/production")
Or via CLI:
mlflow deployments create -t modal:/production -m runs:/<run_id>/model --name my-model
Managing Deployments
# List all deployments
mlflow deployments list -t modal
# Get deployment info
mlflow deployments get -t modal --name my-model
# Delete deployment
mlflow deployments delete -t modal --name my-model
Troubleshooting
Modal Authentication Fails
# Re-authenticate with Modal
modal setup
# Verify authentication
modal profile list
Deployment Times Out
For large models that take longer to load, increase the startup timeout:
config = {
"startup_timeout": 600, # 10 minutes for model loading
"timeout": 300, # 5 minutes for inference requests
}
Missing Dependencies
If the model fails with import errors, add missing packages:
config = {
"extra_pip_packages": ["missing-package>=1.0"],
}
View Build Logs
Check the Modal Dashboard for detailed build and runtime logs.
API Reference
The mlflow-modal-deploy plugin integrates with the standard MLflow deployments interface:
Useful Links
- mlflow-modal-deploy GitHub Repository - Source code, issue tracker, and contribution guidelines.
- Modal Documentation - Comprehensive Modal platform documentation.
- Modal GPU Guide - Detailed information on GPU types and configuration.
- MLflow Model Format - Understanding MLflow model packaging and flavors.