Evaluate & Monitor
MLflow's evaluation and monitoring capabilities help you systematically measure, improve, and maintain the quality of your GenAI applications throughout their lifecycle from development through production.
Why Evaluate GenAI Applications?​
Quality Assurance
Ensure your AI consistently produces accurate, helpful, and safe responses across different inputs and contexts.
Continuous Improvement
Track performance over time and identify specific areas where your AI can be enhanced through systematic evaluation.
Human-AI Collaboration
Combine automated evaluation with human expertise to create comprehensive quality assessment workflows.
Production Monitoring
Monitor AI performance in real-time production environments to maintain quality standards and catch issues early.
Feedback & Expectations​
MLflow provides two complementary approaches to GenAI evaluation that work together to create comprehensive quality assessment:
Feedback captures quality evaluations of how well your AI actually performed. This can come from multiple sources:
- Human reviewers providing expert judgment on response quality
- LLM judges offering automated evaluation at scale
- Programmatic checks validating format, compliance, and business rules
Expectations define the ground truth - what your AI should produce for specific inputs. These establish reference points for objective accuracy measurement and enable systematic testing against known correct answers.
Together, feedback and expectations enable you to measure both subjective quality and objective accuracy, creating a complete evaluation framework for your GenAI applications.
Feedback Collection
Capture quality evaluations from LLM judges, programmatic checks, and human reviewers
Ground Truth Expectations
Define expected outputs and correct answers to establish quality baselines
Additional Evaluation Options​
LLM Evaluation (Legacy)
MLflow evaluation capabilities built on top of the classic mlflow.evaluate API for self-hosted or local MLflow
New Evaluation Suite (Managed-Only)
MLflow 3 introduces a new evaluation suite for LLMs/GenAI, available in Managed MLflow on Databricks