Skip to main content

Use Predefined LLM Scorers

MLflow provides several pre-configured LLM judge scorers optimized for common evaluation scenarios.

tip

Typically, you can get started with evaluation using predefined scorers. However, every AI application is unique and has domain-specific quality criteria. At some point, you'll need to create your own custom LLM scorers.

  • Your application has complex inputs/outputs that predefined scorers can't parse
  • You need to evaluate specific business logic or domain-specific criteria
  • You want to combine multiple evaluation aspects into a single scorer

See custom LLM scorers guide for detailed examples.

Example Usage

To use the predefined LLM scorers, select the scorer class from the available scorers and pass it to the scorers argument of the evaluate function.

import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery, Guidelines

eval_dataset = [
{
"inputs": {"query": "What is the most common aggregate function in SQL?"},
"outputs": "The most common aggregate function in SQL is SUM().",
# Correctness scorer requires an "expected_facts" field.
"expectations": {
"expected_facts": ["Most common aggregate function in SQL is COUNT()."],
},
},
{
"inputs": {"query": "How do I use MLflow?"},
# verbose answer
"outputs": "Hi, I'm a chatbot that answers questions about MLflow. Thank you for asking a great question! I know MLflow well and I'm glad to help you with that. You will love it! MLflow is a Python-based platform that provides a comprehensive set of tools for logging, tracking, and visualizing machine learning models and experiments throughout their entire lifecycle. It consists of four main components: MLflow Tracking for experiment management, MLflow Projects for reproducible runs, MLflow Models for standardized model packaging, and MLflow Model Registry for centralized model lifecycle management. To get started, simply install it with 'pip install mlflow' and then use mlflow.start_run() to begin tracking your experiments with automatic logging of parameters, metrics, and artifacts. The platform creates a beautiful web UI where you can compare different runs, visualize metrics over time, and manage your entire ML workflow efficiently. MLflow integrates seamlessly with popular ML libraries like scikit-learn, TensorFlow, PyTorch, and many others, making it incredibly easy to incorporate into your existing projects!",
"expectations": {
"expected_facts": [
"MLflow is a tool for managing and tracking machine learning experiments."
],
},
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(),
RelevanceToQuery(),
# Guidelines is a special scorer that takes user-defined criteria for evaluation.
# See the "Customizing LLM Judges" section below for more details.
Guidelines(
name="is_concise",
guidelines="The answer must be concise and straight to the point.",
),
],
)
Predefined LLM scorers result

Available Scorers

ScorerWhat does it evaluate?Requires ground-truth?
RelevanceToQueryDoes the app's response directly address the user's input?No
CorrectnessIs the app's response correct compared to ground-truth?Yes
GuidelinesDoes the response adhere to provided guidelines?Yes
ExpectationsGuidelinesDoes the response meet specific expectations and guidelines?Yes
SafetyDoes the app's response avoid harmful or toxic content?No
RetrievalGroundednessIs the app's response grounded in retrieved information?No
RetrievalRelevanceAre retrieved documents relevant to the user's request?No
RetrievalSufficiencyDo retrieved documents contain all necessary information?Yes
Availability

Safety and RetrievalRelevance scorers are currently only available in Databricks managed MLflow and will be open-sourced soon.

Evaluating Retrieval

Built-in scorers for evaluating retrieval (RetrievalGroundedness, RetrievalRelevance, RetrievalSufficiency) require traces to include one or more spans with type RETRIEVER. If you are using automatic tracing integration, MLflow will automatically set the type of spans for you.

Selecting Judge Models

MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.

See Supported Models for more details.

Output Format

Predefined LLM-based scorers in MLflow return structured assessments with three key components:

  • Score: Binary output (yes/no) renders as
    Pass
    or
    Fail
    in the UI.
  • Rationale: Detailed explanation of why the judge made its decision
  • Source: Metadata about the evaluation source
score: "yes"  # or "no"
rationale: "The response accurately addresses the user's question about machine learning concepts, providing clear definitions and relevant examples. The information is factually correct and well-structured."
source: AssessmentSource(
source_type="LLM_JUDGE",
source_id="openai:/gpt-4o-mini"
)
Why Binary Scores?

Binary scoring provides clearer, more consistent evaluations compared to numeric scales (1-5). Research shows that LLMs produce more reliable judgments when asked to make binary decisions rather than rating on a scale. Binary outputs also simplify threshold-based decision making in production systems.

Next Steps