Skip to main content

Scorer Concepts

What are Scorers?

Scorers in MLflow are evaluation functions that assess the quality of your GenAI application outputs. They provide a systematic way to measure performance across different dimensions like correctness, relevance, safety, and adherence to guidelines.

Scorers transform subjective quality assessments into measurable metrics, enabling you to track performance, compare models, and ensure your applications meet quality standards. They range from simple rule-based checks to sophisticated LLM judges that can evaluate nuanced aspects of language generation.

Use Cases

Automated Quality Assessment

Replace manual review processes with automated scoring that can evaluate thousands of outputs consistently and at scale, using either deterministic rules or LLM-based evaluation.

Safety & Compliance Validation

Systematically check for harmful content, bias, PII leakage, and regulatory compliance. Ensure your applications meet organizational and legal standards before deployment.

A/B Testing & Model Comparison

Compare different models, prompts, or configurations using consistent evaluation criteria. Make data-driven decisions about which approach performs best for your use case.

Continuous Quality Monitoring

Track quality metrics over time in production, detect degradations early, and maintain high standards as your application evolves and scales.

Types of Scorers

MLflow provides several types of scorers to address different evaluation needs:

Agent-as-a-Judge

Autonomous agents that analyze execution traces to evaluate not just outputs, but the entire process. They can assess tool usage, reasoning chains, and error handling.

Human-Aligned Judges

LLM judges that have been fine-tuned with human feedback to match your specific quality standards. These provide the consistency of automation with the nuance of human judgment.

LLM-based Scorers (LLM-as-a-Judge)

Use large language models to evaluate subjective qualities like helpfulness, coherence, and style. These scorers can understand context and nuance that rule-based systems miss.

Code-based Scorers

Custom Python functions for deterministic evaluation. Perfect for metrics that can be calculated algorithmically like ROUGE scores, exact match, or custom business logic.

Scorer Output Structure

All scorers in MLflow produce standardized output that integrates seamlessly with the evaluation framework. Scorers return a mlflow.entities.Feedback() object containing:

FieldTypeDescription
namestrUnique identifier for the scorer (e.g., "correctness", "safety")
valueAnyThe evaluation result - can be numeric, boolean, or categorical
rationaleOptional[str]Explanation of why this score was given (especially useful for LLM judges)
metadataOptional[dict]Additional information about the evaluation (confidence, sub-scores, etc.)
errorOptional[str]Error message if the scorer failed to evaluate

Common Scorer Patterns

Agent-as-a-Judge (Trace-Based)

from mlflow.genai.judges import make_judge
import mlflow

# Create an Agent-as-a-Judge that analyzes execution patterns
efficiency_judge = make_judge(
name="efficiency_analyzer",
instructions=(
"Analyze the {{ trace }} for inefficiencies.\n\n"
"Check for:\n"
"- Redundant API calls or database queries\n"
"- Sequential operations that could be parallelized\n"
"- Unnecessary data processing\n\n"
"Rate as: 'efficient', 'acceptable', or 'inefficient'"
),
model="anthropic:/claude-opus-4-1-20250805",
)

# Example: RAG application with retrieval and generation
from mlflow.entities import SpanType
import time


@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_context(query: str):
# Simulate vector database retrieval
time.sleep(0.5) # Retrieval latency
return [
{"doc": "MLflow is an open-source platform", "score": 0.95},
{"doc": "It manages the ML lifecycle", "score": 0.89},
{"doc": "Includes tracking and deployment", "score": 0.87},
]


@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_user_history(user_id: str):
# Another retrieval that could be parallelized
time.sleep(0.5) # Could run parallel with above
return {"previous_queries": ["What is MLflow?", "How to log models?"]}


@mlflow.trace(span_type=SpanType.LLM)
def generate_response(query: str, context: list, history: dict):
# Simulate LLM generation
return f"Based on context about '{query}': MLflow is a platform for ML lifecycle management."


@mlflow.trace(span_type=SpanType.AGENT)
def rag_agent(query: str, user_id: str):
# Sequential operations that could be optimized
context = retrieve_context(query)
history = retrieve_user_history(user_id) # Could be parallel with above
response = generate_response(query, context, history)
return response


# Run the RAG agent
result = rag_agent("What is MLflow?", "user123")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

# Judge analyzes the trace to identify inefficiencies
feedback = efficiency_judge(trace=trace)
print(f"Efficiency: {feedback.value}")
print(f"Analysis: {feedback.rationale}")

LLM Judge (Field-Based)

from mlflow.genai.judges import make_judge

correctness_judge = make_judge(
name="correctness",
instructions=(
"Evaluate if the response in {{ outputs }} "
"correctly answers the question in {{ inputs }}."
),
model="anthropic:/claude-opus-4-1-20250805",
)

# Example usage
feedback = correctness_judge(
inputs={"question": "What is MLflow?"},
outputs={
"response": "MLflow is an open-source platform for ML lifecycle management."
},
)
print(f"Correctness: {feedback.value}")

Reading Level Assessment

import textstat
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback


@scorer
def reading_level(outputs: str) -> Feedback:
"""Evaluate text complexity using Flesch Reading Ease."""
score = textstat.flesch_reading_ease(outputs)

if score >= 60:
level = "easy"
rationale = f"Reading ease score of {score:.1f} - accessible to most readers"
elif score >= 30:
level = "moderate"
rationale = f"Reading ease score of {score:.1f} - college level complexity"
else:
level = "difficult"
rationale = f"Reading ease score of {score:.1f} - expert level required"

return Feedback(value=level, rationale=rationale, metadata={"score": score})

Language Perplexity Scoring

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from mlflow.genai.scorers import scorer


@scorer
def perplexity_score(outputs: str) -> float:
"""Calculate perplexity to measure text quality and coherence."""
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer(outputs, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])

perplexity = torch.exp(outputs.loss).item()
return perplexity # Lower is better - indicates more natural text

Response Latency Tracking

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace


@scorer
def response_time(trace: Trace) -> Feedback:
"""Evaluate response time from trace spans."""
root_span = trace.data.spans[0]
latency_ms = (root_span.end_time - root_span.start_time) / 1e6

if latency_ms < 100:
value = "fast"
elif latency_ms < 500:
value = "acceptable"
else:
value = "slow"

return Feedback(
value=value,
rationale=f"Response took {latency_ms:.0f}ms",
metadata={"latency_ms": latency_ms},
)

Integration with MLflow Evaluation

Scorers are the building blocks of MLflow's evaluation framework. They integrate seamlessly with mlflow.genai.evaluate():

import mlflow
import pandas as pd

# Your test data
test_data = pd.DataFrame(
[
{
"inputs": {"question": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for ML lifecycle management."
},
"expectations": {
"ground_truth": "MLflow is an open-source platform for managing the ML lifecycle"
},
},
{
"inputs": {"question": "How do I track experiments?"},
"outputs": {
"response": "Use mlflow.start_run() to track experiments in MLflow."
},
"expectations": {
"ground_truth": "Use mlflow.start_run() to track experiments"
},
},
]
)


# Your application (optional if data already has outputs)
def my_app(inputs):
# Your model logic here
return {"response": f"Answer to: {inputs['question']}"}


# Evaluate with multiple scorers
results = mlflow.genai.evaluate(
data=test_data,
# predict_fn is optional if data already has outputs
scorers=[
correctness_judge, # LLM judge from above
reading_level, # Custom scorer from above
],
)

# Access evaluation metrics
print(f"Correctness: {results.metrics.get('correctness/mean', 'N/A')}")
print(f"Reading Level: {results.metrics.get('reading_level/mode', 'N/A')}")

Best Practices

  1. Choose the Right Scorer Type

    • Use code-based scorers for objective, deterministic metrics
    • Use LLM judges for subjective qualities requiring understanding
    • Use Agent-as-a-Judge for evaluating complex multi-step processes
  2. Combine Multiple Scorers

    • No single metric captures all aspects of quality
    • Use a portfolio of scorers to get comprehensive evaluation
    • Balance efficiency (fast code-based) with depth (LLM and Agent judges)
  3. Align with Human Judgment

    • Validate that your scorers correlate with human quality assessments
    • Use human feedback to improve LLM and Agent judge instructions
    • Consider using human-aligned judges for critical evaluations
  4. Monitor Scorer Performance

    • Track scorer execution time and costs
    • Monitor for scorer failures and handle gracefully
    • Regularly review scorer outputs for consistency

Next Steps