Judge Dataset Integration
Evaluation datasets enable systematic testing and improvement of your custom LLM judges. By building datasets from traces and adding ground truth labels, you can measure judge accuracy and identify areas for improvement.
Why Integrate Judges with Datasets?
Consistent Test Data
Evaluation datasets provide reproducible test cases, ensuring consistent judge performance measurement across iterations.
Ground Truth Comparison
Expectations in datasets serve as ground truth, enabling automatic accuracy measurement of judge evaluations.
Systematic Improvement
Track judge performance over time, identify weaknesses, and systematically improve through alignment.
Version Control
Track datasets for complete evaluation reproducibility.
Evaluation datasets require a SQL-based tracking backend. Set it up before using datasets:
mlflow.set_tracking_uri("sqlite:///mlflow.db") # Or PostgreSQL/MySQL
Complete Example: Build → Evaluate → Improve
Here's how to build a dataset from traces, evaluate your judge, and improve accuracy:
import mlflow
from mlflow.genai.judges import make_judge
from mlflow.genai.datasets import create_dataset
# Set up MLflow with SQL backend (required for datasets)
mlflow.set_tracking_uri("sqlite:///mlflow.db")
# Step 1: Build dataset from traces
dataset = create_dataset(
name="judge_accuracy_test",
experiment_id="0",
tags={"purpose": "judge_validation", "version": "1.0"},
)
# Get traces from an experiment and add to dataset
traces_df = mlflow.search_traces(
experiment_ids=["0"], max_results=100
) # Returns DataFrame with trace data
# Add traces directly - MLflow extracts inputs automatically
dataset.merge_records(traces_df)
print(f"Added {len(traces_df)} records to evaluation dataset")
# Step 2: Add ground truth expectations for accuracy measurement
edge_cases = [
{
"inputs": {"question": ""}, # Empty input
"expectations": {"quality": "poor", "reason": "empty_input"},
},
{
"inputs": {"question": "How do I reset my password?"},
"expectations": {"quality": "good", "helpful": "yes"},
},
{
"inputs": {"question": "URGENT!!! HELP!!!"},
"expectations": {"quality": "poor", "reason": "no_clear_question"},
},
]
dataset.merge_records(edge_cases)
# Step 3: Create judge and evaluate
quality_judge = make_judge(
name="answer_quality",
instructions=(
"Evaluate if {{ outputs }} properly addresses {{ inputs }}. "
"Rate as 'good', 'fair', or 'poor'."
),
model="anthropic:/claude-opus-4-1-20250805",
)
def my_app(question):
# Your application logic
return {"answer": f"Response to: {question}"}
# Evaluate judge performance
result = mlflow.genai.evaluate(data=dataset, scorers=[quality_judge], predict_fn=my_app)
# Step 4: Iterate and improve
# - Review results in MLflow UI
# - Add more test cases based on errors
# - Collect human feedback on judge outputs
# - Use alignment to improve judge accuracy
Key Integration Points
Dataset Operations
# Create dataset
dataset = create_dataset(name="my_dataset", experiment_id="0")
# Add traces
traces_df = mlflow.search_traces(experiment_ids=["0"])
dataset.merge_records(traces_df)
# Add manual test cases
test_cases = [
{"inputs": {...}, "expectations": {...}},
{"inputs": {...}, "expectations": {...}},
]
dataset.merge_records(test_cases)
# Access dataset records
df = dataset.to_df()
Evaluation with Datasets
# Datasets work seamlessly with mlflow.genai.evaluate
result = mlflow.genai.evaluate(
data=dataset, # Pass dataset directly
scorers=[judge1, judge2],
predict_fn=my_app, # Generate outputs at evaluation time
)
Best Practices
- Start with Traces: Bootstrap datasets using traces from development or QA testing
- Add Edge Cases: Include problematic inputs to test judge robustness
- Label Strategically: Focus ground truth labels on critical or ambiguous cases
- Iterate Regularly: Continuously expand datasets as your application evolves
- Track Metrics: Log judge accuracy metrics to monitor improvement over time
Learn More
Judge Alignment
Learn how to align judges with human feedback for improved accuracy.
Workflow Examples
See complete production patterns for judge development and deployment.
Custom LLM Judges
Return to the overview to explore more judge features and capabilities.