What are Scorers?
Scorers are key components of the MLflow GenAI evaluation framework. They provide a unified interface to define evaluation criteria for your models, agents, and applications.
Scorers can be considered as metrics in the traditional ML sense. However, they are more flexible and can return more structured quality feedback, not only the scalar values that are typically represented by metrics.
How scorers work
Scorers analyze inputs, outputs, and traces from your GenAI application and produce quality assessments. Here's the flow:
- You provide a dataset of inputs(and optionally other columns such asexpectations)
- MLflow runs your
predict_fn
to generateoutputsandtracesfor each row in the dataset. Alternatively, you can provide outputs and traces directly in the dataset and omit the predict function. - Scorers receive the inputs,outputs,expectations, andtraces(or a subset of them) and produce scores and metadata such as explanations and source information.
- MLflow aggregates the scorer results and saves them. You can analyze the results in the UI.
Evaluation Approaches Comparison
Evaluating the quality of your GenAI application can be done in different ways. Here's a comparison of the different approaches and how you can use them in MLflow. Click on the guide links to learn more about each approach:
Type | Heuristic | LLM-based | Human |
---|---|---|---|
Description | Deterministic metrics such as exact_match , BLEU . | Let LLMs judge subjective qualities such as Correctness | Ask domain experts or users to provide feedback. |
Cost | Minimal | API costs for LLM calls | High (human time) |
Scalability | Highly scalable | Scalable | Limited |
Consistency | Perfect consistency | Somewhat consistent (if prompted well) | Variable (inter-annotator agreement) |
Flexibility | Limited to predefined patterns | Highly flexible with custom prompts | Maximum flexibility |
MLflow Guide | Custom Scorers | LLM-based Scorers | Collecting Human Feedback |
Scorer Data Requirements
Not all scorers work with static datasets. Some scorers are designed specifically to analyze execution traces and will not function with static pandas DataFrames that only contain inputs/outputs/expectations.
Scorer Compatibility Matrix
Scorer Category | Static Data Support | Trace Required | Example Scorers |
---|---|---|---|
Field-based Scorers | ✅ Yes | ❌ No | Correctness , RelevanceToQuery , Safety , custom scorers using @scorer |
Trace-based Scorers | ❌ No | ✅ Yes | Scorers analyzing execution flow, tool usage patterns, span attributes |
Retrieval Scorers | ❌ No | ✅ Yes (with RETRIEVER spans) | RetrievalGroundedness , RetrievalRelevance , RetrievalSufficiency |
Agent Behavior Scorers | ❌ No | ✅ Yes | Custom scorers analyzing multi-step workflows, tool trajectories |
Key Points:
- Static datasets (pandas DataFrames with inputs/outputs/expectations) work only with field-based scorers
- Trace-based scorers require actual execution traces from your application
- If you see errors about missing traces when using retrieval or agent scorers, you need to:
- Use a
predict_fn
that generates traces, OR - Include pre-collected traces in your dataset
- Use a