Skip to main content

Deploy MLflow Model to Modal

Modal is a serverless cloud platform optimized for AI/ML workloads, offering on-demand GPU access with automatic scaling. The mlflow-modal-deploy plugin enables one-command deployment of MLflow models to Modal's infrastructure.

If you are new to MLflow model deployment, please read MLflow Deployment first to understand the basic concepts of MLflow models and deployments.

How it Works

The plugin automates the deployment process:

  1. Extract: MLflow model artifacts and dependencies are extracted from the model URI
  2. Upload: Model files are uploaded to a Modal Volume for persistent storage
  3. Generate: A Modal app is generated with FastAPI endpoints (/invocations, /predict_stream)
  4. Deploy: Modal builds a container with all dependencies and deploys to serverless infrastructure
  5. Serve: An HTTPS endpoint URL is returned, ready to handle prediction requests

The generated container mirrors your training environment, ensuring consistent behavior between development and production. Modal handles auto-scaling, GPU allocation, and container lifecycle management automatically.

Deploying Model to Modal

This section outlines the process of deploying a model to Modal using the MLflow deployment plugin. For Python API references and tutorials, see the Useful links section.

Step 0: Preparation

Install Libraries

Ensure the installation of the following libraries:

bash
pip install mlflow mlflow-modal-deploy modal

Authentication Setup

Configure Modal authentication:

bash
# Interactive setup (recommended)
modal setup

# Or use environment variables
export MODAL_TOKEN_ID=your-token-id
export MODAL_TOKEN_SECRET=your-token-secret

Create an MLflow Model

Before deploying, you must have an MLflow Model. If you don't have one, you can create a sample scikit-learn model by following the MLflow Tracking Quickstart. Remember to note down the model URI, such as runs:/<run_id>/model (or models:/<model_name>/<model_version> if you registered the model in the MLflow Model Registry).

Step 1: Test Your Model Locally

It's recommended to test your model locally before deploying to production. The run_local function deploys the model using modal serve for local testing:

python
from mlflow_modal import run_local

run_local(
target_uri="modal",
name="test-model",
model_uri="runs:/<run_id>/model",
config={"gpu": "T4"},
)

This allows you to verify that:

  • The model loads correctly with all dependencies
  • The inference endpoint responds as expected
  • The GPU configuration is valid

Step 2: Deploy to Modal

Once local testing passes, deploy to Modal's cloud infrastructure.

python
from mlflow.deployments import get_deploy_client

client = get_deploy_client("modal")

deployment = client.create_deployment(
name="my-classifier",
model_uri="runs:/<run_id>/model",
config={
"gpu": "T4",
"memory": 2048,
"min_containers": 1,
},
)

print(f"Deployed to: {deployment['endpoint_url']}")

Step 3: Make Predictions

After deployment, you can make predictions using the deployment client:

python
from mlflow.deployments import get_deploy_client

client = get_deploy_client("modal")

# Standard predictions
predictions = client.predict(
deployment_name="my-classifier",
inputs={"feature1": [1, 2, 3], "feature2": [4, 5, 6]},
)

# Streaming predictions (for LLM models)
for chunk in client.predict_stream(
deployment_name="my-llm",
inputs={"messages": [{"role": "user", "content": "Hello!"}]},
):
print(chunk, end="", flush=True)

Configuration Options

The following configuration options are available when creating a deployment:

OptionTypeDefaultDescription
gpustr/listNoneGPU type: T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200. Supports multi-GPU (H100:8), dedicated (H100!), or fallback list (["H100", "A100"])
memoryint512Memory allocation in MB
cpufloat1.0CPU cores
timeoutint300Request timeout in seconds
startup_timeoutintNoneContainer startup timeout (useful for large models)
scaledown_windowint60Seconds before idle container scales down
concurrent_inputsint1Max concurrent requests per container
min_containersint0Minimum warm containers (set > 0 to avoid cold starts)
max_containersintNoneMaximum containers
enable_batchingboolFalseEnable dynamic request batching
max_batch_sizeint8Max batch size when batching enabled
batch_wait_msint100Batch wait time in milliseconds
extra_pip_packageslist[]Additional pip packages to install

For detailed information on these options, see the Modal documentation:

Advanced Usage

GPU Selection

Modal supports a wide range of GPU types for different workloads. See Modal's GPU documentation for the full list of available GPUs and configuration options.

python
# Single GPU
config = {"gpu": "T4"} # Cost-effective for inference

# High-performance GPU
config = {"gpu": "H100"} # Best for large models

# Multi-GPU for large models
config = {"gpu": "H100:8"} # 8x H100 GPUs

# Dedicated GPU (no sharing)
config = {"gpu": "H100!"}

# Fallback list (uses first available)
config = {"gpu": ["H100", "A100", "A10"]}

High-Throughput Deployment

For high-throughput workloads, enable dynamic batching:

python
client.create_deployment(
name="batch-classifier",
model_uri="runs:/<run_id>/model",
config={
"gpu": "A100",
"enable_batching": True,
"max_batch_size": 32,
"batch_wait_ms": 50,
"min_containers": 2,
"max_containers": 20,
},
)

Deploy to Specific Workspace

Deploy to a specific Modal workspace:

python
# Use workspace-specific URI
client = get_deploy_client("modal:/production")

Or via CLI:

bash
mlflow deployments create -t modal:/production -m runs:/<run_id>/model --name my-model

Managing Deployments

bash
# List all deployments
mlflow deployments list -t modal

# Get deployment info
mlflow deployments get -t modal --name my-model

# Delete deployment
mlflow deployments delete -t modal --name my-model

Troubleshooting

bash
# Re-authenticate with Modal
modal setup

# Verify authentication
modal profile list

Deployment Times Out

For large models that take longer to load, increase the startup timeout:

python
config = {
"startup_timeout": 600, # 10 minutes for model loading
"timeout": 300, # 5 minutes for inference requests
}

Missing Dependencies

If the model fails with import errors, add missing packages:

python
config = {
"extra_pip_packages": ["missing-package>=1.0"],
}

View Build Logs

Check the Modal Dashboard for detailed build and runtime logs.

API Reference

The mlflow-modal-deploy plugin integrates with the standard MLflow deployments interface:

Useful Links