Custom Code-based Scorers
Custom scorers offer the ultimate flexibility to define precisely how your GenAI application's quality is measured. They provide the flexibility to define evaluation metrics tailored to your specific business use case, whether based on simple heuristics, advanced logic, or programmatic evaluations.
Example Usage
To define a custom scorer, you can define a function that takes in the input arguments and add the @scorer decorator to the function.
from mlflow.genai import scorer
@scorer
def exact_match(outputs: dict, expectations: dict) -> bool:
return outputs == expectations["expected_response"]
To return richer information beyond primitive values, you can return a Feedback object.
from mlflow.entities import Feedback
@scorer
def is_short(outputs: dict) -> Feedback:
score = len(outputs.split()) <= 5
rationale = (
"The response is short enough."
if score
else f"The response is not short enough because it has ({len(outputs.split())} words)."
)
return Feedback(value=score, rationale=rationale)
Then you can pass the functions directly to the mlflow.genai.evaluate function, just like other predefined or LLM-based scorers.
import mlflow
eval_dataset = [
{
"inputs": {"question": "How many countries are there in the world?"},
"outputs": "195",
"expectations": {"expected_response": "195"},
},
{
"inputs": {"question": "What is the capital of France?"},
"outputs": "The capital of France is Paris.",
"expectations": {"expected_response": "Paris"},
},
]
mlflow.genai.evaluate(
data=eval_dataset,
scorers=[exact_match, is_short],
)

Input Format
As input, custom scorers have access to:
- The
inputs
dictionary, derived from either the input dataset or MLflow post-processing from your trace. - The
outputs
value, derived from either the input dataset or trace. Ifpredict_fn
is provided, theoutputs
value will be the return value of thepredict_fn
. - The
expectations
dictionary, derived from theexpectations
field in the input dataset, or associated with the trace. - The complete MLflow trace, including spans, attributes, and outputs.
@scorer
def my_scorer(
*,
inputs: dict[str, Any],
outputs: Any,
expectations: dict[str, Any],
trace: Trace,
) -> float | bool | str | Feedback | list[Feedback]:
# Your evaluation logic here
...
All parameters are optional; declare only what your scorer needs:
# ✔️ All of these signatures are valid for scorers
def my_scorer(inputs, outputs, expectations, trace) -> bool:
def my_scorer(inputs, outputs) -> str:
def my_scorer(outputs, expectations) -> Feedback:
def my_scorer(trace) -> list[Feedback]:
# 🔴 Additional parameters are not allowed
def my_scorer(inputs, outputs, expectations, trace, additional_param) -> float
When running mlflow.genai.evaluate()
, the inputs, outputs, and expectations parameters can be specified in the data argument, or parsed from the trace. See How Scorers Work for more details.
Return Types
Scorers can return different types depending on your evaluation needs:
Simple values
Return primitive values for straightforward pass/fail or numeric assessments.
- Pass/fail strings:
"yes"
or"no"
render asPassorFailin the UI - Boolean values:
True
orFalse
for binary evaluations - Numeric values: Integers or floats for scores, counts, or measurements
Rich feedback
Return Feedback objects for detailed assessments with additional metadata such as explanation, source info, and error summary.
from mlflow.entities import Feedback, AssessmentSource
@scorer
def content_quality(outputs):
return Feedback(
value=0.85, # Can be numeric, boolean, or string
rationale="Clear and accurate, minor grammar issues",
# Optional: source of the assessment. Several source types are supported,
# such as "HUMAN", "CODE", "LLM_JUDGE".
source=AssessmentSource(source_type="CODE", source_id="grammar_checker_v1"),
# Optional: additional metadata about the assessment.
metadata={
"annotator": "me@example.com",
},
)
Multiple feedback objects can be returned as a list. Each feedback object will be displayed as a separate metric in the evaluation results.
@scorer
def comprehensive_check(inputs, outputs):
return [
Feedback(name="relevance", value=True, rationale="Directly addresses query"),
Feedback(name="tone", value="professional", rationale="Appropriate for audience"),
Feedback(name="length", value=150, rationale="Word count within limits")
]
Parsing Traces for Scoring
Scorers have access to the complete MLflow traces, including spans, attributes, and outputs, allowing you to evaluate the agent's behavior precisely, not just the final output.
The Trace.search_spans
API is a powerful way to retrieve such intermediate information from the trace.
Open the tabs below to see examples of custom scorers that evaluate the detailed behavior of agents by parsing the trace.
- Retrieved Document Recall
- Tool Call Trajectory
- Sub-Agents Routing
Example 1: Evaluating Retrieved Documents Recall
from mlflow.entities import SpanType, Trace
from mlflow.genai import scorer
@scorer
def retrieved_document_recall(trace: Trace, expectations: dict) -> Feedback:
# Search for retriever spans in the trace
retriever_spans = trace.search_spans(span_type=SpanType.RETRIEVER)
# If there are no retriever spans
if not retriever_spans:
return Feedback(
value=0,
rationale="No retriever span found in the trace.",
)
# Gather all retrieved document URLs from the retriever spans
all_document_urls = []
for span in retriever_spans:
all_document_urls.extend([document["doc_uri"] for document in span.outputs])
# Compute the recall
true_positives = len(
set(all_document_urls) & set(expectations["relevant_document_urls"])
)
expected_positives = len(expectations["relevant_document_urls"])
recall = true_positives / expected_positives
return Feedback(
value=recall,
rationale=f"Retrieved {true_positives} relevant documents out of {expected_positives} expected.",
)
Example 2: Evaluating Tool Call Trajectory
from mlflow.entities import SpanType, Trace
from mlflow.genai import scorer
@scorer
def tool_call_trajectory(trace: Trace, expectations: dict) -> Feedback:
# Search for tool call spans in the trace
tool_call_spans = trace.search_spans(span_type=SpanType.TOOL)
# Compare the tool trajectory with expectations
actual_trajectory = [span.name for span in tool_call_spans]
expected_trajectory = expectations["tool_call_trajectory"]
if actual_trajectory == expected_trajectory:
return Feedback(value=1, rationale="The tool call trajectory is correct.")
else:
return Feedback(
value=0,
rationale=(
"The tool call trajectory is incorrect.\n"
f"Expected: {expected_trajectory}.\n"
f"Actual: {actual_trajectory}."
),
)
Example 3: Evaluating Sub-Agents Routing
from mlflow.entities import SpanType, Trace
from mlflow.genai import scorer
@scorer
def is_routing_correct(trace: Trace, expectations: dict) -> Feedback:
# Search for sub-agent spans in the trace
sub_agent_spans = trace.search_spans(span_type=SpanType.AGENT)
invoked_agents = [span.name for span in sub_agent_spans]
expected_agents = expectations["expected_agents"]
if invoked_agents == expected_agents:
return Feedback(value=True, rationale="The sub-agents routing is correct.")
else:
return Feedback(
value=False,
rationale=(
"The sub-agents routing is incorrect.\n"
f"Expected: {expected_agents}.\n"
f"Actual: {invoked_agents}."
),
)
Error handling
When a scorer encounters an error, MLflow provides two approaches:
Let exceptions propagate (recommended)
The simplest approach is to let exceptions throw naturally. MLflow automatically captures the exception and creates a Feedback object with the error details:
import json
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
@scorer
def is_valid_response(outputs: str) -> Feedback:
# Let json.JSONDecodeError propagate if response isn't valid JSON
data = json.loads(outputs)
# Let KeyError propagate if required fields are missing
summary = data["summary"]
confidence = data["confidence"]
return Feedback(value=True, rationale=f"Valid JSON with confidence: {confidence}")
# Run the scorer on invalid data that triggers exceptions
invalid_data = [
{
# Valid JSON
"outputs": '{"summary": "this is a summary", "confidence": 0.95}'
},
{
# Invalid JSON
"outputs": "invalid json",
},
{
# Missing required fields
"outputs": '{"summary": "this is a summary"}'
},
]
mlflow.genai.evaluate(
data=invalid_data,
scorers=[is_valid_response],
)
When an exception occurs, MLflow creates a Feedback with:
value
: Noneerror
: The exception details, such as exception object, error message, and stack trace
The error information will be displayed in the evaluation results. Open the corresponding row to see the error details.

Handle exceptions explicitly
For custom error handling or to provide specific error messages, catch exceptions and return a Feedback with None value and error details:
import json
from mlflow.entities import AssessmentError, Feedback
@scorer
def is_valid_response(outputs):
try:
data = json.loads(outputs)
required_fields = ["summary", "confidence", "sources"]
missing = [f for f in required_fields if f not in data]
if missing:
# Specify the AssessmentError object explicitly
return Feedback(
error=AssessmentError(
error_code="MISSING_REQUIRED_FIELDS",
error_message=f"Missing required fields: {missing}",
),
)
return Feedback(value=True, rationale="Valid JSON with all required fields")
except json.JSONDecodeError as e:
# Can pass exception object directly to the error parameter as well
return Feedback(error=e)
Next Steps
Evaluate Agents
Learn how to evaluate AI agents with specialized techniques and scorers
Evaluate Traces
Evaluate production traces to understand and improve your AI application's behavior
Ground Truth Expectations
Learn how to define and manage ground truth data for accurate evaluations