Production Monitoring for GenAI Applications

Machine learning projects don't conclude with their initial launch. Ongoing monitoring and incremental enhancements are critical for long-term success. MLflow Tracing offers comprehensive observability for your production applications, supporting the iterative process of continuous improvement while ensuring quality delivery to users.

GenAI applications face unique challenges that make production monitoring essential. Quality drift can occur over time due to model updates, data distribution shifts, or new user interaction patterns. The operational complexity of multi-step workflows involving LLMs, vector databases, and retrieval systems creates multiple potential failure points that need continuous oversight. Cost management becomes critical as token usage and API costs can vary significantly based on user behavior and model performance.

Key Monitoring Areas

Understanding what to monitor helps you focus on metrics that actually impact user experience and business outcomes. Rather than trying to monitor everything, focus on areas that provide actionable insights for your specific application and user base.

Operational Metrics
Quality Metrics
Business Impact

Performance and Reliability: Monitor end-to-end response times from user request to final response, including LLM inference latency, retrieval system performance, and component-level bottlenecks. Track overall error rates, LLM API failures, timeout occurrences, and dependency failures to maintain system reliability.

Resource Utilization: Monitor token consumption patterns, API cost tracking, request throughput, and system resource usage to optimize performance and control costs.

Business Metrics: Track user engagement rates, session completion rates, feature adoption, and user satisfaction scores to understand the business impact of your application.

Setting Up Tracing for Production Endpoints

When deploying your GenAI application to production, you need to configure MLflow Tracing to send traces to your MLflow tracking server. This configuration forms the foundation for all production observability capabilities.

Pro Tip: Using the Lightweight Tracing SDK

The MLflow Tracing SDK mlflow-tracing is a lightweight package that only includes the minimum set of dependencies to instrument your code/models/agents with MLflow Tracing.

⚡️ Faster Deployment: Significantly smaller package size and fewer dependencies enable quicker deployments in containers and serverless environments

🔧 Simple Dependency Management: Reduced dependencies mean less maintenance overhead and fewer potential conflicts

📦 Enhanced Portability: Easily deploy across different platforms with minimal compatibility concerns

🔒 Improved Security: Smaller attack surface with fewer dependencies reduces security risks

🚀 Performance Optimizations: Optimized for high-volume tracing in production environments

Compatibility Warning

When installing the MLflow Tracing SDK, make sure the environment does not have the full MLflow package installed. Having both packages in the same environment might cause conflicts and unexpected behaviors.

Environment Variable Configuration

Configure the following environment variables in your production environment. See Production Monitoring Configurations below for more details about these configurations.

# Required: Set MLflow Tracking URI
export MLFLOW_TRACKING_URI="http://your-mlflow-server:5000"

# Optional: Configure the experiment name for organizing traces
export MLFLOW_EXPERIMENT_NAME="production-genai-app"

# Optional: Configure async logging (recommended for production)
export MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=true
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS=10
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE=1000

# Optional: Configure trace sampling ratio (default is 1.0)
export MLFLOW_TRACE_SAMPLING_RATIO=0.1

Self-Hosted Tracking Server

You can use the MLflow tracking server to store production traces. However, the tracking server is optimized for offline experience and generally not suitable for handling hyper-scale traffic. For high-volume production workloads, consider using OpenTelemetry integration with dedicated observability platforms.

If you choose to use the tracking server in production, we strongly recommend:

Use SQL-based tracking server on top of a scalable database and artifact storage
Configure proper indexing on trace tables for better query performance
Set up periodic deletion for trace data management
Monitor server performance and scale appropriately

Refer to the tracking server setup guide for more details.

Performance Considerations

Database: Use PostgreSQL or MySQL for better concurrent write performance rather than SQLite for production deployments.

Storage: Use cloud storage (S3, Azure Blob, GCS) for artifact storage to ensure scalability and reliability.

Indexing: Ensure proper indexes on timestamp_ms, status, and frequently queried tag columns to maintain query performance as trace volume grows.

Retention: Implement data retention policies to manage storage costs and maintain system performance over time.

Docker Deployment Example

When deploying with Docker, pass environment variables through your container configuration:

# Dockerfile
FROM python:3.9-slim

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application code
COPY . /app
WORKDIR /app

# Set default environment variables (can be overridden at runtime)
ENV MLFLOW_TRACKING_URI=""
ENV MLFLOW_EXPERIMENT_NAME="production-genai-app"
ENV MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=true

CMD ["python", "app.py"]

Run the container with environment variables:

docker run -d \
  -e MLFLOW_TRACKING_URI="http://your-mlflow-server:5000" \
  -e MLFLOW_EXPERIMENT_NAME="production-genai-app" \
  -e MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=true \
  -e APP_VERSION="1.0.0" \
  your-app:latest

Kubernetes Deployment Example

For Kubernetes deployments, use ConfigMaps and Secrets:

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mlflow-config
data:
  MLFLOW_TRACKING_URI: 'http://mlflow-server:5000'
  MLFLOW_EXPERIMENT_NAME: 'production-genai-app'
  MLFLOW_ENABLE_ASYNC_TRACE_LOGGING: 'true'

---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: your-app:latest
          envFrom:
            - configMapRef:
                name: mlflow-config
          env:
            - name: APP_VERSION
              value: '1.0.0'

OpenTelemetry Integration

Traces generated by MLflow are compatible with the OpenTelemetry trace specs. Therefore, MLflow traces can be exported to various observability platforms that support OpenTelemetry.

By default, MLflow exports traces to the MLflow Tracking Server. To enable exporting traces to an OpenTelemetry Collector, set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable before starting any trace. You can also enable dual export to send traces to both MLflow and OpenTelemetry simultaneously.

pip install opentelemetry-exporter-otlp

import mlflow
import os

# Set the endpoint of the OpenTelemetry Collector
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://localhost:4317/v1/traces"
# Optionally, set the service name to group traces
os.environ["OTEL_SERVICE_NAME"] = "your-service-name"

# Trace will be exported to the OTel collector
with mlflow.start_span(name="foo") as span:
    span.set_inputs({"a": 1})
    span.set_outputs({"b": 2})

Metrics Export

MLflow can export OpenTelemetry metrics when a metrics endpoint is configured. This allows you to monitor span durations and other trace-related metrics in compatible monitoring systems.

Prerequisites: The opentelemetry-exporter-otlp library must be installed to enable metrics export:

pip install opentelemetry-exporter-otlp

To enable metrics export:

Configure OpenTelemetry metrics endpoint:

# For OpenTelemetry Collector (gRPC endpoint)
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://localhost:4317"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="grpc"

# OR for OpenTelemetry Collector (HTTP endpoint)
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://localhost:4318/v1/metrics"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="http/protobuf"

Direct Prometheus Export

Prometheus can directly receive OpenTelemetry metrics exported by MLflow:

# Configure MLflow to send metrics directly to Prometheus
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://localhost:9090/api/v1/otlp/v1/metrics"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="http/protobuf"

Prometheus configuration: Start Prometheus with --web.enable-otlp-receiver and --enable-feature=otlp-deltatocumulative flags to accept OTLP metrics directly.

Exported Metrics

When enabled, MLflow exports the following OpenTelemetry histogram metric:

mlflow.trace.span.duration: A histogram measuring span execution duration in milliseconds
- Unit: ms (milliseconds)
- Labels/Attributes:
  - root: "true" for root spans, "false" for child spans
  - span_type: The type of span (e.g., "LLM", "CHAIN", "AGENT", or "unknown")
  - span_status: The span status ("OK", "ERROR", or "UNSET")
  - experiment_id: The MLflow experiment ID associated with the trace
  - tags.*: All trace tags (e.g., tags.mlflow.traceName, tags.mlflow.evalRequestId)
  - metadata.*: All trace metadata (e.g., metadata.mlflow.sourceRun, metadata.mlflow.modelId, metadata.mlflow.trace.tokenUsage)

This histogram allows you to analyze:

Response time distributions across different span types
Performance differences between root spans and child spans
Error rates by monitoring spans with "ERROR" status
Performance metrics grouped by MLflow experiment
Metrics segmented by trace tags (e.g., tags.mlflow.traceName, tags.mlflow.evalRequestId)
Performance analysis by model ID or source run (e.g., metadata.mlflow.modelId, metadata.mlflow.sourceRun)
Service performance trends over time

Complete Example

import mlflow
import os

# Enable metrics collection and export
os.environ["OTEL_EXPORTER_OTLP_METRICS_ENDPOINT"] = "http://localhost:4317"
os.environ["OTEL_EXPORTER_OTLP_METRICS_PROTOCOL"] = "grpc"

# Metrics will be exported to OpenTelemetry Collector
with mlflow.start_span(name="process_request", span_type="CHAIN") as span:
    span.set_inputs({"query": "What is MLflow?"})
    # Your application logic here
    span.set_outputs({"response": "MLflow is an open source platform..."})

Supported Observability Platforms

Click on the following icons to learn more about how to set up OpenTelemetry Collector for your specific observability platform:

OpenTelemetry Configuration

MLflow uses the standard OTLP Exporter for exporting traces to OpenTelemetry Collector instances. You can use all of the configurations supported by OpenTelemetry:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://localhost:4317/v1/traces"
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/protobuf"
export OTEL_EXPORTER_OTLP_TRACES_HEADERS="api_key=12345"

Dual Export

By default, when OTLP export is configured, MLflow sends traces only to the OpenTelemetry Collector. To send traces to both MLflow Tracking Server and OpenTelemetry Collector simultaneously, set MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true:

import mlflow
import os

# Enable dual export
os.environ["MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT"] = "true"
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://localhost:4317/v1/traces"

# Configure MLflow tracking
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("my-experiment")

# Traces will be sent to both MLflow and OTel collector
with mlflow.start_span(name="foo") as span:
    span.set_inputs({"a": 1})
    span.set_outputs({"b": 2})

Trace Export Behavior

Default (MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=false): Traces are sent only to OpenTelemetry Collector
Dual Export (MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true): Traces are sent to both MLflow Tracking Server and OpenTelemetry Collector

Managed Monitoring with Databricks

Databricks also offers a managed solution for monitoring your GenAI applications that integrates with MLflow Tracing.

Monitoring Hero

Capabilities include:

Track operational metrics like request volume, latency, errors, and cost.
Monitor quality metrics such as correctness, safety, context sufficiency, and more using managed evaluation.
Configure custom metrics with Python function.
Root cause analysis by looking at the recorded traces from MLflow Tracing.
Support for applications hosted outside of Databricks

Production Monitoring Configurations

Asynchronous Trace Logging

For production applications, MLflow logs traces asynchronously by default to prevent blocking your application:

Environment Variable	Description	Default Value
`MLFLOW_ENABLE_ASYNC_TRACE_LOGGING`	Whether to log traces asynchronously. When set to `False`, traces will be logged in a blocking manner.	`True`
`MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS`	The maximum number of worker threads to use for async trace logging per process. Increasing this allows higher throughput of trace logging, but also increases CPU usage and memory consumption.	`10`
`MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE`	The maximum number of traces that can be queued before being logged to backend by the worker threads. When the queue is full, new traces will be discarded. Increasing this allows higher durability of trace logging, but also increases memory consumption.	`1000`
`MLFLOW_ASYNC_TRACE_LOGGING_RETRY_TIMEOUT`	The timeout in seconds for retrying failed trace logging. When a trace logging fails, it will be retried up to this timeout with backoff, after which it will be discarded.	`500`

Example configuration for high-volume applications:

export MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=true
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS=20
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE=2000
export MLFLOW_ASYNC_TRACE_LOGGING_RETRY_TIMEOUT=600

Sampling Traces

For a high-volume application, you may want to reduce the number of traces exported to the backend. You can configure the sampling ratio to control the number of traces exported.

Environment Variable	Description	Default Value
`MLFLOW_TRACE_SAMPLING_RATIO`	The sampling ratio for traces. When set to `0.0`, no traces will be exported. When set to `1.0`, all traces will be exported.	`1.0`

The default value is 1.0, which means all traces will be exported. When set to less than 1.0, say 0.1, only 10% of the traces will be exported. The sampling is done at the trace level, meaning that all spans in some traces will be exported or discarded together.

Adding Context to Production Traces

In production environments, enriching traces with contextual information is crucial for understanding user behavior, debugging issues, and improving your GenAI application. This context enables you to analyze user interactions, track quality across different segments, and identify patterns that lead to better or worse outcomes.

Tracking Request, Session, and User Context

Production applications need to track multiple pieces of context simultaneously. Here's a comprehensive example showing how to track all of these in a FastAPI application:

import mlflow
import os
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel

# Initialize FastAPI app
app = FastAPI()


class ChatRequest(BaseModel):
    message: str


@app.post("/chat")  # FastAPI decorator should be outermost
@mlflow.trace  # Ensure @mlflow.trace is the inner decorator
def handle_chat(request: Request, chat_request: ChatRequest):
    # Retrieve all context from request headers
    client_request_id = request.headers.get("X-Request-ID")
    session_id = request.headers.get("X-Session-ID")
    user_id = request.headers.get("X-User-ID")

    # Update the current trace with all context and environment metadata
    mlflow.update_current_trace(
        client_request_id=client_request_id,
        tags={
            # Session context - groups traces from multi-turn conversations
            "mlflow.trace.session": session_id,
            # User context - associates traces with specific users
            "mlflow.trace.user": user_id,
            # Environment metadata - tracks deployment context
            "environment": "production",
            "app_version": os.getenv("APP_VERSION", "1.0.0"),
            "deployment_id": os.getenv("DEPLOYMENT_ID", "unknown"),
            "region": os.getenv("REGION", "us-east-1"),
        },
    )

    # Your application logic for processing the chat message
    response_text = f"Processed message: '{chat_request.message}'"

    return {"response": response_text}

Feedback Collection

Capturing user feedback on specific interactions is essential for understanding quality and improving your GenAI application:

import mlflow
from mlflow.client import MlflowClient
from fastapi import FastAPI, Query, Request
from pydantic import BaseModel
from typing import Optional
from mlflow.entities import AssessmentSource

app = FastAPI()


class FeedbackRequest(BaseModel):
    is_correct: bool  # True for correct, False for incorrect
    comment: Optional[str] = None


@app.post("/chat_feedback")
def handle_chat_feedback(
    request: Request,
    client_request_id: str = Query(
        ..., description="The client request ID from the original chat request"
    ),
    feedback: FeedbackRequest = ...,
):
    """
    Collect user feedback for a specific chat interaction identified by client_request_id.
    """
    # Search for the trace with the matching client_request_id
    client = MlflowClient()
    experiment = client.get_experiment_by_name("production-genai-app")
    traces = client.search_traces(experiment_ids=[experiment.experiment_id])
    traces = [
        trace for trace in traces if trace.info.client_request_id == client_request_id
    ][:1]

    if not traces:
        return {
            "status": "error",
            "message": f"Unable to find data for client request ID: {client_request_id}",
        }, 500

    # Log feedback using MLflow's log_feedback API
    mlflow.log_feedback(
        trace_id=traces[0].info.trace_id,
        name="response_is_correct",
        value=feedback.is_correct,
        source=AssessmentSource(
            source_type="HUMAN", source_id=request.headers.get("X-User-ID")
        ),
        rationale=feedback.comment,
    )

    return {
        "status": "success",
        "message": "Feedback recorded successfully",
        "trace_id": traces[0].info.trace_id,
    }

Querying Traces with Context

Use the contextual information to analyze production behavior:

import mlflow

# Query traces by user
user_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.user` = 'user-jane-doe-12345'",
    max_results=100,
)

# Query traces by session
session_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.session` = 'session-def-456'",
    max_results=100,
)

# Query traces by environment
production_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.environment = 'production'",
    max_results=100,
)

Summary

Production monitoring with MLflow Tracing provides comprehensive observability for your GenAI applications. Understanding how users actually interact with your application, monitoring quality and performance in real-world conditions, and tracking the business impact of your GenAI initiatives are all essential for long-term success.

Key recommendations for successful production deployments include using mlflow-tracing for production deployments to minimize dependencies and optimize performance, configuring async logging for high-volume applications to prevent blocking, adding rich context with tags and metadata for effective debugging and analysis, implementing feedback collection for quality monitoring and continuous improvement, considering OpenTelemetry integration for enterprise observability platforms, and monitoring performance while implementing proper error handling.

Whether you're using self-hosted MLflow, integrating with enterprise observability platforms through OpenTelemetry, or leveraging Databricks Mosaic AI's advanced capabilities, MLflow Tracing provides the foundation for understanding and improving your production GenAI applications.

Next Steps

Debug & Observe Your App with Tracing: Learn foundational observability concepts and techniques

Query Traces via SDK: Understand how to programmatically access trace data for analysis

Track Users & Sessions: Implement user and session context tracking for better monitoring insights

Key Monitoring Areas​

Setting Up Tracing for Production Endpoints​

Pro Tip: Using the Lightweight Tracing SDK​

Environment Variable Configuration​

Self-Hosted Tracking Server​

Performance Considerations​

Docker Deployment Example​

Kubernetes Deployment Example​

OpenTelemetry Integration​

Metrics Export​

Direct Prometheus Export​

Exported Metrics​

Complete Example​

Supported Observability Platforms​

OpenTelemetry Configuration​

Dual Export​

Managed Monitoring with Databricks​

Production Monitoring Configurations​

Asynchronous Trace Logging​

Sampling Traces​

Adding Context to Production Traces​

Tracking Request, Session, and User Context​

Feedback Collection​

Querying Traces with Context​

Summary​

Next Steps​