What are Scorers?

Scorers are key components of the MLflow GenAI evaluation framework. They provide a unified interface to define evaluation criteria for your models, agents, and applications.

Scorers can be considered as metrics in the traditional ML sense. However, they are more flexible and can return more structured quality feedback, not only the scalar values that are typically represented by metrics.

How scorers work

Scorers analyze inputs, outputs, and traces from your GenAI application and produce quality assessments. Here's the flow:

You provide a dataset of
inputs
(and optionally other columns such as
expectations
)
MLflow runs your predict_fn to generate
outputs
and
traces
for each row in the dataset. Alternatively, you can provide outputs and traces directly in the dataset and omit the predict function.
Scorers receive the
inputs
,
outputs
,
expectations
, and
traces
(or a subset of them) and produce scores and metadata such as explanations and source information.
MLflow aggregates the scorer results and saves them. You can analyze the results in the UI.

Evaluation Approaches Comparison

Evaluating the quality of your GenAI application can be done in different ways. Here's a comparison of the different approaches and how you can use them in MLflow. Click on the guide links to learn more about each approach:

Type	Heuristic	LLM-based	Human
Description	Deterministic metrics such as `exact_match`, `BLEU`.	Let LLMs judge subjective qualities such as `Correctness`	Ask domain experts or users to provide feedback.
Cost	Minimal	API costs for LLM calls	High (human time)
Scalability	Highly scalable	Scalable	Limited
Consistency	Perfect consistency	Somewhat consistent (if prompted well)	Variable (inter-annotator agreement)
Flexibility	Limited to predefined patterns	Highly flexible with custom prompts	Maximum flexibility
MLflow Guide	Custom Scorers	LLM-based Scorers	Collecting Human Feedback

Scorer Data Requirements

Important: Trace-based vs Static Data Compatibility

Not all scorers work with static datasets. Some scorers are designed specifically to analyze execution traces and will not function with static pandas DataFrames that only contain inputs/outputs/expectations.

Scorer Compatibility Matrix

Scorer Category	Static Data Support	Trace Required	Example Scorers
Field-based Scorers	✅ Yes	❌ No	`Correctness`, `RelevanceToQuery`, `Safety`, custom scorers using `@scorer`
Trace-based Scorers	❌ No	✅ Yes	Scorers analyzing execution flow, tool usage patterns, span attributes
Retrieval Scorers	❌ No	✅ Yes (with RETRIEVER spans)	`RetrievalGroundedness`, `RetrievalRelevance`, `RetrievalSufficiency`
Agent Behavior Scorers	❌ No	✅ Yes	Custom scorers analyzing multi-step workflows, tool trajectories

Key Points:

Static datasets (pandas DataFrames with inputs/outputs/expectations) work only with field-based scorers
Trace-based scorers require actual execution traces from your application
If you see errors about missing traces when using retrieval or agent scorers, you need to:
1. Use a predict_fn that generates traces, OR
2. Include pre-collected traces in your dataset

What are Scorers?

How scorers work

Evaluation Approaches Comparison

Scorer Data Requirements

Scorer Compatibility Matrix

Key Points:

Next Steps

LLM-based Scorers

Evaluate Agents

Collect Ground Truth

How scorers work​

Evaluation Approaches Comparison​

Scorer Data Requirements​

Scorer Compatibility Matrix​

Key Points:​

Next Steps​

LLM-based Scorers

Evaluate Agents

Collect Ground Truth

How scorers work

Evaluation Approaches Comparison

Scorer Data Requirements

Scorer Compatibility Matrix

Key Points:

Next Steps