Question 1

What is AI monitoring?

Accepted Answer

AI monitoring is the practice of continuously assessing the quality, performance, cost, and safety of AI applications running in production, including LLM and agent-based systems. Unlike traditional software monitoring (uptime, error rates), AI monitoring must evaluate the quality of non-deterministic text outputs, track token costs, detect hallucinations, and identify when model behavior drifts from expected standards.

Question 2

How is AI monitoring different from classical ML monitoring?

Accepted Answer

Classical ML monitoring tracks feature distributions, prediction accuracy on structured outputs, and data drift using statistical tests. AI monitoring must handle free-form language outputs, multi-step agent reasoning, tool call chains, retrieval accuracy, subjective quality dimensions, and token-based costs. You can't use distribution histograms to detect when an LLM starts hallucinating or when an agent picks the wrong tool.

Question 3

What is quality drift in LLM and agent applications?

Accepted Answer

Quality drift occurs when the quality of AI outputs degrades over time without obvious errors. For LLMs and agents, this can manifest as increased hallucinations, declining response relevance, safety regressions, or coherence loss. Root causes include model version updates from providers, prompt template changes, upstream data pipeline modifications, or shifts in user input patterns. Unlike classical ML drift, AI quality drift is often subtle and requires LLM judges to detect.

Question 4

Why do agents and LLM applications need continuous monitoring in production?

Accepted Answer

Development evaluation uses curated datasets that represent expected inputs. Production traffic is diverse, unpredictable, and evolves over time. New query patterns, adversarial inputs, and edge cases emerge continuously. Without production evaluation, you can't detect when quality degrades on inputs you didn't anticipate during development. Continuous evaluation catches regressions, emerging failure patterns, and quality drift before users lose trust.

Question 5

What role do traces play in AI monitoring?

Accepted Answer

Traces capture the complete execution graph of every request: LLM calls, tool invocations, retrieval operations, and intermediate reasoning steps, along with inputs, outputs, latency, and token usage at each step. In AI monitoring, traces serve three critical functions: (1) they provide the data that LLM judges evaluate for quality, (2) they enable root-cause analysis when quality drops or errors spike, and (3) they power cost and latency tracking across the full request lifecycle.

Question 6

What is trace sampling and why is it important for AI monitoring?

Accepted Answer

Trace sampling controls what percentage of production requests are fully traced and evaluated. Running LLM judges on every request is cost-prohibitive at scale. Sampling strategies include global ratios (e.g., 5% of traffic), per-endpoint overrides (100% on critical endpoints, 1% on high-volume ones), and error-biased sampling (always capture failures). Deterministic safety checks (PII detection, format validation) can still run on 100% of traffic since they're fast and cheap.

Question 7

How do you detect hallucinations in production?

Accepted Answer

Production hallucination detection combines multiple approaches: LLM judges that compare responses against retrieved context (groundedness scoring), consistency checks that verify the same query produces semantically similar outputs, factual verification against known data sources, and pattern detection for common hallucination signatures. The key challenge is that hallucinations are often plausible-sounding, requiring semantic analysis rather than simple pattern matching.

Question 8

What metrics should I track for production LLM and agent monitoring?

Accepted Answer

Key metrics span four dimensions: (1) Quality: LLM judge scores (correctness, relevance, safety, groundedness), hallucination rate, task completion rate (agents), user satisfaction scores. (2) Performance: p50/p95/p99 latency, time-to-first-token, error rates, retry rates. (3) Cost: tokens per request, cost per request, cost per task completion, daily/monthly spend trends. (4) Safety: prompt injection detection rate, PII leakage incidents, policy violation rate, jailbreak attempt frequency.

Question 9

How do feedback loops improve AI monitoring?

Accepted Answer

User feedback (thumbs up/down, ratings, corrections) provides ground truth that automated judges cannot. Feedback identifies silent failures where outputs look acceptable to judges but fail users. Over time, feedback is used to: calibrate LLM judges (so automated scores match human expectations), build regression datasets from real failures, discover new failure modes to monitor, and prioritize quality improvements.

Question 10

What security threats should AI monitoring detect?

Accepted Answer

Production LLM applications face unique security risks: prompt injection (users manipulating the model to ignore instructions), PII leakage (model exposing sensitive data from training or retrieval), jailbreaks (bypassing safety guidelines to produce harmful content), and data exfiltration through crafted queries. Monitoring must include input scanning, output scanning, behavioral anomaly detection, and comprehensive audit trails for compliance.

Question 11

What is asynchronous quality evaluation in production?

Accepted Answer

Asynchronous evaluation runs LLM judges on production traces in the background, after the user has already received their response. This prevents quality assessment from adding latency to the request path. Traces are queued, and worker threads evaluate them using configured scorers. Results are stored alongside traces for dashboard visualization and alerting. This architecture lets you evaluate a meaningful sample of production traffic without impacting user experience.

Question 12

How is agent monitoring different from LLM monitoring?

Accepted Answer

LLM monitoring assesses individual model calls: prompt quality, response accuracy, token usage, and latency. Agent monitoring must also track multi-step reasoning trajectories, tool selection accuracy, error recovery behavior, goal completion rates, and execution efficiency. An agent might produce a correct final answer through an inefficient path (calling wrong tools, getting stuck in loops, using excessive retries). Agent monitoring evaluates the full trajectory, not just the final output.

Question 13

How does MLflow enable AI monitoring for agents and LLM applications?

Accepted Answer

MLflow provides an integrated AI monitoring stack: (1) Automatic tracing with one-line instrumentation across 50+ frameworks (OpenAI, LangChain, Anthropic, LlamaIndex, etc.), (2) Asynchronous trace logging that doesn't impact application performance, (3) Automatic online evaluation where registered LLM judges score production traces in the background, (4) Configurable trace sampling for cost control, (5) Automatic token and cost tracking, (6) Human feedback collection via log_feedback() linked to traces, (7) User/session/request context tracking via update_current_trace(), and (8) OpenTelemetry compatibility for data portability.

Question 14

What is the MLflow lightweight production tracing SDK?

Accepted Answer

The mlflow-tracing package is a lightweight production tracing SDK that is 95% smaller than full MLflow (approximately 5MB vs 1000MB). It includes only essential tracing functionality, making it ideal for Docker containers, serverless functions, and resource-constrained environments. It supports the same one-line auto-instrumentation and manual tracing APIs as the full SDK. It works with self-hosted MLflow, Databricks, and any OpenTelemetry-compatible backend.

Question 15

How does MLflow handle trace sampling in production?

Accepted Answer

MLflow supports configurable trace sampling at two levels. Globally, the MLFLOW_TRACE_SAMPLING_RATIO environment variable (0.0 to 1.0, default 1.0) controls the default sampling rate. Per-endpoint, use @mlflow.trace(sampling_ratio_override=...) to override the global rate for specific functions (e.g., 100% for payment processing, 10% for high-volume chat). Sampling decisions happen before trace submission to minimize overhead. This lets teams balance monitoring coverage against computational and storage costs.

Question 16

What built-in scorers does MLflow provide for AI monitoring?

Accepted Answer

MLflow provides built-in LLM judges across multiple quality dimensions: Safety (harmful content), Correctness (factual accuracy), RelevanceToQuery (response addresses the question), Groundedness (supported by context), ToolCallEfficiency (optimal tool usage), RoleAdherence (stays in role across turns), ConversationalSafety (multi-turn safety), and more. Additionally, MLflow integrates with third-party scorer libraries including DeepEval (50+ metrics), RAGAS (RAG-specific metrics), TruLens (agent trajectory scoring), and Phoenix (hallucination, toxicity).

Question 17

How does MLflow track costs and token usage in production?

Accepted Answer

MLflow automatically extracts token counts (input, output, total) from every LLM span and calculates costs using model-aware pricing. You can view cost breakdowns per trace, per model, and across time in the UI. The AI Gateway adds per-endpoint usage analytics. For unsupported models, you can set custom costs via the span API. This gives teams visibility into exactly where money is being spent and helps identify optimization opportunities.

Question 18

How does MLflow detect quality drift in production?

Accepted Answer

MLflow detects quality drift by continuously running LLM judges on production traces and tracking scores over time. When quality metrics (correctness pass rate, safety scores, relevance) trend downward compared to baselines, teams can investigate using the trace UI. MLflow also supports metric backfill, letting you retroactively apply new scorers to historical traces to establish baselines and detect when drift began. Combined with human feedback, judges can be calibrated to catch domain-specific quality regressions.

Question 19

Does MLflow support alerting for production quality issues?

Accepted Answer

MLflow provides the assessment and scoring infrastructure that feeds alerting systems. Production scorers generate quality metrics on every evaluated trace. These metrics can be queried programmatically via mlflow.search_traces() and aggregated into time-series dashboards. Teams integrate these metrics with their existing alerting tools (PagerDuty, Slack, etc.) to trigger alerts on quality score drops, cost anomalies, latency spikes, or safety incidents. The AI Gateway also supports rate limiting and cost budgets as proactive guardrails.

Question 20

Can I use MLflow AI monitoring with any LLM provider or agent framework?

Accepted Answer

Yes. MLflow supports any LLM provider (OpenAI, Anthropic, AWS Bedrock, Google Gemini, Azure OpenAI, Mistral, Cohere, Ollama, and more) and any agent framework (LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen, DSPy, Pydantic AI, and more). One-line auto-instrumentation is available for 50+ libraries. MLflow is fully OpenTelemetry-compatible, so you can export traces to any OTel-compatible backend. SDKs are available for Python, JavaScript, and TypeScript.

Question 21

How does MLflow handle human feedback in AI monitoring?

Accepted Answer

MLflow integrates human feedback directly into the monitoring pipeline. Users can provide ratings (thumbs up/down, star ratings) linked to specific traces via log_feedback(). This feedback serves multiple purposes: ground-truthing LLM judges, identifying silent failures that automated scoring misses, and improving judges over time. MemAlign learns evaluation guidelines from just a handful of natural-language feedback examples, enabling judges to align with domain experts without expensive retraining.

Question 22

How does MLflow handle security threats like prompt injection and PII leakage?

Accepted Answer

MLflow's AI monitoring captures complete audit trails of all inputs, outputs, and model interactions for compliance and incident investigation. Built-in safety scorers detect harmful content, PII exposure, and policy violations in outputs. The AI Gateway adds real-time guardrails that filter inputs for prompt injection attempts and scan outputs for PII, toxicity, or policy violations before they reach users. Combined with tracing, you can investigate security incidents with full execution context.

Question 23

Is MLflow AI monitoring free and open source?

Accepted Answer

Yes. MLflow is 100% open source under the Apache 2.0 license, backed by the Linux Foundation. All AI monitoring features (tracing, evaluation, scorers, cost tracking, feedback collection) are free, including for commercial use. There are no per-trace fees, no usage limits, and no vendor lock-in. You can self-host MLflow or use managed versions on Databricks, AWS, and other platforms. Your production data stays under your control.

Question 24

How do I get started with MLflow AI monitoring?

Accepted Answer

MLflow AI monitoring combines tracing, automatic LLM judge evaluation, human feedback collection, and cost tracking into a unified stack. See the How to Implement AI Monitoring section above for a step-by-step walkthrough with code examples, or jump straight to the production tracing documentation for detailed setup guides and framework-specific examples.

Question 25

What is the difference between MLflow's development evaluation and AI monitoring?

Accepted Answer

Development evaluation runs scorers and LLM judges against curated benchmark datasets to validate quality before deployment. AI monitoring runs the same scorers and LLM judges on live production traffic to detect issues that emerge from real-world usage. MLflow unifies both: use the same scorers, LLM judges, and evaluation APIs in development and production to ensure consistent quality standards. Development evaluation gives you confidence to deploy; AI monitoring gives you confidence it's still working well.

LLMs & Agents

Model Training

LLMs & Agents

Model Training

AI Monitoring for LLMs and Agents

Why AI Monitoring Matters

Quality Drift Detection

Cost and Latency Control

Safety and Security

Production Debugging

AI Monitoring Use Cases

How to Implement AI Monitoring

Open Source vs. Proprietary AI Monitoring Tools

Frequently Asked Questions

Related Resources