Skip to main content

What are Scorers?

Scorers are key components of the MLflow GenAI evaluation framework. They provide a unified interface to define evaluation criteria for your models, agents, and applications.

Scorers can be considered as metrics in the traditional ML sense. However, they are more flexible and can return more structured quality feedback, not only the scalar values that are typically represented by metrics.

Key Features of MLflow Scorers

MLflow scorers have two powerful capabilities that distinguish them from traditional metrics:

  1. Agent-as-a-Judge Evaluation: Scorers can act as autonomous agents with tool-calling capabilities, enabling deep analysis of execution traces and multi-step workflows
  2. Human Preference Alignment: Scorers can be automatically aligned with human feedback to improve their accuracy and match your domain-specific quality standards

How scorers work

Scorers analyze inputs, outputs, and traces from your GenAI application and produce quality assessments. Here's the flow:

  1. You provide a dataset of
    inputs
    (and optionally other columns such as
    expectations
    )
  2. MLflow runs your predict_fn to generate
    outputs
    and
    traces
    for each row in the dataset. Alternatively, you can provide outputs and traces directly in the dataset and omit the predict function.
  3. Scorers receive the
    inputs
    ,
    outputs
    ,
    expectations
    , and
    traces
    (or a subset of them) and produce scores and metadata such as explanations and source information.
  4. MLflow aggregates the scorer results and saves them. You can analyze the results in the UI.

Evaluation Approaches Comparison

Evaluating the quality of your GenAI application can be done in different ways. Here's a comparison of the different approaches and how you can use them in MLflow. Click on the guide links to learn more about each approach:

TypeHeuristicLLM-basedHuman
DescriptionDeterministic metrics such as exact_match, BLEU.Let LLMs judge subjective qualities such as CorrectnessAsk domain experts or users to provide feedback.
CostMinimalAPI costs for LLM callsHigh (human time)
ScalabilityHighly scalableScalableLimited
ConsistencyPerfect consistencySomewhat consistent (if prompted well)Variable (inter-annotator agreement)
FlexibilityLimited to predefined patternsHighly flexible with custom promptsMaximum flexibility
MLflow GuideCustom ScorersLLM-based ScorersCollecting Human Feedback

Scorer Data Requirements

Important: Agent-as-a-Judge vs Static Data Compatibility

Not all scorers work with static datasets. Some scorers are designed specifically to analyze execution traces and will not function with static pandas DataFrames that only contain inputs/outputs/expectations.

Scorer Compatibility Matrix

Scorer CategoryTrace RequiredExample Scorers
Field-based Scorers❌ NoCorrectness, RelevanceToQuery, Safety, custom scorers using @scorer
Agent-as-a-Judge Scorers✅ YesScorers analyzing execution flow, tool usage patterns, span attributes
Retrieval Scorers✅ Yes (with RETRIEVER spans)RetrievalGroundedness, RetrievalRelevance, RetrievalSufficiency
Agent-as-a-Judge✅ YesCustom judges analyzing multi-step workflows, agent reasoning, performance

Working with Traces

If you have static data but need to use trace-based scorers, you can create traces using mlflow.log_trace():

import mlflow
from mlflow.entities import Trace, TraceData, TraceInfo

# Create a trace from static data
trace_data = TraceData(
spans=[], # Add spans if needed
request={"question": "What is MLflow?"},
response={
"answer": "MLflow is an open-source platform for ML lifecycle management."
},
)

trace_info = TraceInfo(request_id="unique_id_123", experiment_id="0")

trace = Trace(trace_info, trace_data)

# Log the trace to MLflow
mlflow.log_trace(trace)

# Now you can use trace-based scorers with this data

Alternatively, if you're using retrieval or Agent-as-a-Judge scorers, ensure your predict_fn generates proper traces during execution.

Next Steps