Use Predefined LLM Scorers
MLflow provides several pre-configured LLM judge scorers optimized for common evaluation scenarios.
Typically, you can get started with evaluation using predefined scorers. However, every AI application is unique and has domain-specific quality criteria. At some point, you'll need to create your own custom LLM scorers.
- Your application has complex inputs/outputs that predefined scorers can't parse
- You need to evaluate specific business logic or domain-specific criteria
- You want to combine multiple evaluation aspects into a single scorer
See custom LLM scorers guide for detailed examples.
Example Usage
To use the predefined LLM scorers, select the scorer class from the available scorers and pass it to the scorers
argument of the evaluate function.
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery, Guidelines
eval_dataset = [
{
"inputs": {"query": "What is the most common aggregate function in SQL?"},
"outputs": "The most common aggregate function in SQL is SUM().",
# Correctness scorer requires an "expected_facts" field.
"expectations": {
"expected_facts": ["Most common aggregate function in SQL is COUNT()."],
},
},
{
"inputs": {"query": "How do I use MLflow?"},
# verbose answer
"outputs": "Hi, I'm a chatbot that answers questions about MLflow. Thank you for asking a great question! I know MLflow well and I'm glad to help you with that. You will love it! MLflow is a Python-based platform that provides a comprehensive set of tools for logging, tracking, and visualizing machine learning models and experiments throughout their entire lifecycle. It consists of four main components: MLflow Tracking for experiment management, MLflow Projects for reproducible runs, MLflow Models for standardized model packaging, and MLflow Model Registry for centralized model lifecycle management. To get started, simply install it with 'pip install mlflow' and then use mlflow.start_run() to begin tracking your experiments with automatic logging of parameters, metrics, and artifacts. The platform creates a beautiful web UI where you can compare different runs, visualize metrics over time, and manage your entire ML workflow efficiently. MLflow integrates seamlessly with popular ML libraries like scikit-learn, TensorFlow, PyTorch, and many others, making it incredibly easy to incorporate into your existing projects!",
"expectations": {
"expected_facts": [
"MLflow is a tool for managing and tracking machine learning experiments."
],
},
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(),
RelevanceToQuery(),
# Guidelines is a special scorer that takes user-defined criteria for evaluation.
# See the "Customizing LLM Judges" section below for more details.
Guidelines(
name="is_concise",
guidelines="The answer must be concise and straight to the point.",
),
],
)

Available Scorers
Scorer | What does it evaluate? | Requires ground-truth? |
---|---|---|
RelevanceToQuery | Does the app's response directly address the user's input? | No |
Correctness | Is the app's response correct compared to ground-truth? | Yes |
Guidelines | Does the response adhere to provided guidelines? | Yes |
ExpectationsGuidelines | Does the response meet specific expectations and guidelines? | Yes |
Safety | Does the app's response avoid harmful or toxic content? | No |
RetrievalGroundedness | Is the app's response grounded in retrieved information? | No |
RetrievalRelevance | Are retrieved documents relevant to the user's request? | No |
RetrievalSufficiency | Do retrieved documents contain all necessary information? | Yes |
Safety and RetrievalRelevance scorers are currently only available in Databricks managed MLflow and will be open-sourced soon.
Built-in scorers for evaluating retrieval (RetrievalGroundedness
, RetrievalRelevance
, RetrievalSufficiency
) require traces to include one or more spans with type RETRIEVER
. If you are using automatic tracing integration, MLflow will automatically set the type of spans for you.
Selecting Judge Models
MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.
See Supported Models for more details.
Output Format
Predefined LLM-based scorers in MLflow return structured assessments with three key components:
- Score: Binary output (
yes
/no
) renders asPassorFailin the UI. - Rationale: Detailed explanation of why the judge made its decision
- Source: Metadata about the evaluation source
score: "yes" # or "no"
rationale: "The response accurately addresses the user's question about machine learning concepts, providing clear definitions and relevant examples. The information is factually correct and well-structured."
source: AssessmentSource(
source_type="LLM_JUDGE",
source_id="openai:/gpt-4o-mini"
)
Binary scoring provides clearer, more consistent evaluations compared to numeric scales (1-5). Research shows that LLMs produce more reliable judgments when asked to make binary decisions rather than rating on a scale. Binary outputs also simplify threshold-based decision making in production systems.
Next Steps
Guidelines Scorer
Learn how to use the Guidelines scorer to evaluate responses against custom criteria
Evaluate Agents
Learn how to evaluate AI agents with specialized techniques and scorers
Evaluate Traces
Evaluate production traces to understand and improve your AI application's behavior