Skip to main content

Predefined LLM Scorers

MLflow provides several pre-configured LLM judge scorers optimized for common evaluation scenarios.

tip

Typically, you can get started with evaluation using predefined scorers. However, every AI application is unique and has domain-specific quality criteria. At some point, you'll need to create your own custom LLM scorers.

  • Your application has complex inputs/outputs that predefined scorers can't parse
  • You need to evaluate specific business logic or domain-specific criteria
  • You want to combine multiple evaluation aspects into a single scorer

See custom LLM scorers guide for detailed examples.

Example Usage

To use the predefined LLM scorers, select the scorer class from the available scorers and pass it to the scorers argument of the evaluate function.

python
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery, Guidelines

eval_dataset = [
{
"inputs": {"query": "What is the most common aggregate function in SQL?"},
"outputs": "The most common aggregate function in SQL is SUM().",
# Correctness scorer requires an "expected_facts" field.
"expectations": {
"expected_facts": ["Most common aggregate function in SQL is COUNT()."],
},
},
{
"inputs": {"query": "How do I use MLflow?"},
# verbose answer
"outputs": "Hi, I'm a chatbot that answers questions about MLflow. Thank you for asking a great question! I know MLflow well and I'm glad to help you with that. You will love it! MLflow is a Python-based platform that provides a comprehensive set of tools for logging, tracking, and visualizing machine learning models and experiments throughout their entire lifecycle. It consists of four main components: MLflow Tracking for experiment management, MLflow Projects for reproducible runs, MLflow Models for standardized model packaging, and MLflow Model Registry for centralized model lifecycle management. To get started, simply install it with 'pip install mlflow' and then use mlflow.start_run() to begin tracking your experiments with automatic logging of parameters, metrics, and artifacts. The platform creates a beautiful web UI where you can compare different runs, visualize metrics over time, and manage your entire ML workflow efficiently. MLflow integrates seamlessly with popular ML libraries like scikit-learn, TensorFlow, PyTorch, and many others, making it incredibly easy to incorporate into your existing projects!",
"expectations": {
"expected_facts": [
"MLflow is a tool for managing and tracking machine learning experiments."
],
},
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(),
RelevanceToQuery(),
# Guidelines is a special scorer that takes user-defined criteria for evaluation.
# See the "Customizing LLM Judges" section below for more details.
Guidelines(
name="is_concise",
guidelines="The answer must be concise and straight to the point.",
),
],
)
Predefined LLM scorers result

Available Scorers

Single-Turn Scorers

ScorerWhat does it evaluate?Requires ground-truth?Requires traces?
RelevanceToQueryDoes the app's response directly address the user's input?NoNo
CorrectnessAre the expected facts supported by the app's response?Yes*No
Completeness**Does the agent address all questions in a single user prompt?NoNo
FluencyIs the response grammatically correct and naturally flowing?NoNo
GuidelinesDoes the response adhere to provided guidelines?Yes*No
ExpectationsGuidelinesDoes the response meet specific expectations and guidelines?Yes*No
SafetyDoes the app's response avoid harmful or toxic content?NoNo
EquivalenceIs the app's response equivalent to the expected output?YesNo
RetrievalGroundednessIs the app's response grounded in retrieved information?No⚠️ Trace Required
RetrievalRelevanceAre retrieved documents relevant to the user's request?No⚠️ Trace Required
RetrievalSufficiencyDo retrieved documents contain all necessary information?Yes⚠️ Trace Required

*Can extract expectations from trace assessments if available.

**Indicates experimental features that may change in future releases.

Multi-Turn Scorers

Multi-turn scorers evaluate entire conversation sessions rather than individual turns. They require traces with session IDs and are experimental in MLflow 3.7.0.

ScorerWhat does it evaluate?Requires Session?
ConversationCompleteness**Does the agent address all user questions throughout the conversation?Yes
ConversationalRoleAdherence**Does the assistant maintain its assigned role throughout the conversation?Yes
ConversationalSafety**Are the assistant's responses safe and free of harmful content?Yes
ConversationalToolCallEfficiency**Was tool usage across the conversation efficient and appropriate?Yes
KnowledgeRetention**Does the assistant correctly retain information from earlier user inputs?Yes
UserFrustration**Is the user frustrated? Was the frustration resolved?Yes
Multi-Turn Evaluation Requirements

Multi-turn scorers require:

  1. Session IDs: Traces must have mlflow.trace.session metadata
  2. List or DataFrame input: Currently only supports pre-collected traces (no predict_fn support yet)

See the Evaluate Conversations section below for detailed usage examples.

Availability

Safety and RetrievalRelevance scorers are currently only available in Databricks managed MLflow and will be open-sourced soon.

Using Traces with Built-in Scorers

All built-in scorers, such as Guidelines, RelevanceToQuery, Safety, Correctness, and ExpectationsGuidelines, can extract inputs and outputs directly from traces:

python
from mlflow.genai.scorers import Correctness

trace = mlflow.get_trace("<your-trace-id>")
scorer = Correctness()

# Extracts inputs/outputs from trace automatically
result = scorer(trace=trace)

# Override specific fields as needed
result = scorer(trace=trace, expectations={"expected_facts": ["Custom fact"]})

Automatic Fallback for Complex Traces

For complex traces or those that do not contain inputs and outputs in the root span, the scorer will use tool calling to provide the trace information to an LLM judge.

Retrieval Scorers Require Traces

Retrieval scorers will NOT work with static pandas DataFrames that only contain inputs/outputs/expectations fields.

These scorers require:

  1. Active traces with spans of type RETRIEVER
  2. Either a predict_fn that generates traces during evaluation, OR pre-collected traces in your dataset

Common Error: If you're trying to use retrieval scorers with a static dataset and getting errors about missing traces or RETRIEVER spans, you need to either:

  • Switch to scorers that work with static data (marked with ✅ in the table above)
  • Modify your evaluation to use a predict_fn that generates traces
  • Use automatic tracing integration with your application

Selecting Judge Models

MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.

See Supported Models for more details.

Output Format

Predefined LLM-based scorers in MLflow return structured assessments with three key components:

  • Score: Binary output (yes/no) renders as
    Pass
    or
    Fail
    in the UI.
  • Rationale: Detailed explanation of why the judge made its decision
  • Source: Metadata about the evaluation source
text
score: "yes"  # or "no"
rationale: "The response accurately addresses the user's question about machine learning concepts, providing clear definitions and relevant examples. The information is factually correct and well-structured."
source: AssessmentSource(
source_type="LLM_JUDGE",
source_id="openai:/gpt-4o-mini"
)
Why Binary Scores?

Binary scoring provides clearer, more consistent evaluations compared to numeric scales (1-5). Research shows that LLMs produce more reliable judgments when asked to make binary decisions rather than rating on a scale. Binary outputs also simplify threshold-based decision making in production systems.

Evaluate Conversations

Multi-turn scorers evaluate entire conversation sessions rather than individual turns. For detailed information on how to use conversation evaluation, including setup, examples, and best practices, see the Evaluate Conversations guide.

Next Steps