Judge Dataset Integration

Evaluation datasets enable systematic testing and improvement of your custom LLM judges. By building datasets from traces and adding ground truth labels, you can measure judge accuracy and identify areas for improvement.

Build Dataset

Create & Run Judges

Analyze Accuracy

Collect Feedback

Align & Improve

Why Integrate Judges with Datasets?

Consistent Test Data

Evaluation datasets provide reproducible test cases, ensuring consistent judge performance measurement across iterations.

Ground Truth Comparison

Expectations in datasets serve as ground truth, enabling automatic accuracy measurement of judge evaluations.

Systematic Improvement

Track judge performance over time, identify weaknesses, and systematically improve through alignment.

Version Control

Track datasets for complete evaluation reproducibility.

SQL Backend Required

Evaluation datasets require a SQL-based tracking backend. Set it up before using datasets:

mlflow.set_tracking_uri("sqlite:///mlflow.db")  # Or PostgreSQL/MySQL

Complete Example: Build → Evaluate → Improve

Here's how to build a dataset from traces, evaluate your judge, and improve accuracy:

import mlflow
from mlflow.genai.judges import make_judge
from mlflow.genai.datasets import create_dataset

# Set up MLflow with SQL backend (required for datasets)
mlflow.set_tracking_uri("sqlite:///mlflow.db")

# Step 1: Build dataset from traces
dataset = create_dataset(
    name="judge_accuracy_test",
    experiment_id="0",
    tags={"purpose": "judge_validation", "version": "1.0"},
)

# Get traces from an experiment and add to dataset
traces_df = mlflow.search_traces(
    experiment_ids=["0"], max_results=100
)  # Returns DataFrame with trace data

# Add traces directly - MLflow extracts inputs automatically
dataset.merge_records(traces_df)
print(f"Added {len(traces_df)} records to evaluation dataset")

# Step 2: Add ground truth expectations for accuracy measurement
edge_cases = [
    {
        "inputs": {"question": ""},  # Empty input
        "expectations": {"quality": "poor", "reason": "empty_input"},
    },
    {
        "inputs": {"question": "How do I reset my password?"},
        "expectations": {"quality": "good", "helpful": "yes"},
    },
    {
        "inputs": {"question": "URGENT!!! HELP!!!"},
        "expectations": {"quality": "poor", "reason": "no_clear_question"},
    },
]
dataset.merge_records(edge_cases)

# Step 3: Create judge and evaluate
quality_judge = make_judge(
    name="answer_quality",
    instructions=(
        "Evaluate if {{ outputs }} properly addresses {{ inputs }}. "
        "Rate as 'good', 'fair', or 'poor'."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)


def my_app(question):
    # Your application logic
    return {"answer": f"Response to: {question}"}


# Evaluate judge performance
result = mlflow.genai.evaluate(data=dataset, scorers=[quality_judge], predict_fn=my_app)

# Step 4: Iterate and improve
# - Review results in MLflow UI
# - Add more test cases based on errors
# - Collect human feedback on judge outputs
# - Use alignment to improve judge accuracy

Key Integration Points

Dataset Operations

# Create dataset
dataset = create_dataset(name="my_dataset", experiment_id="0")

# Add traces
traces_df = mlflow.search_traces(experiment_ids=["0"])
dataset.merge_records(traces_df)

# Add manual test cases
test_cases = [
    {"inputs": {...}, "expectations": {...}},
    {"inputs": {...}, "expectations": {...}},
]
dataset.merge_records(test_cases)

# Access dataset records
df = dataset.to_df()

Evaluation with Datasets

# Datasets work seamlessly with mlflow.genai.evaluate
result = mlflow.genai.evaluate(
    data=dataset,  # Pass dataset directly
    scorers=[judge1, judge2],
    predict_fn=my_app,  # Generate outputs at evaluation time
)

Best Practices

Start with Traces: Bootstrap datasets using traces from development or QA testing
Add Edge Cases: Include problematic inputs to test judge robustness
Label Strategically: Focus ground truth labels on critical or ambiguous cases
Iterate Regularly: Continuously expand datasets as your application evolves
Track Metrics: Log judge accuracy metrics to monitor improvement over time

Learn More

Judge Alignment

Learn how to align judges with human feedback for improved accuracy.

Explore alignment →

Workflow Examples

See complete production patterns for judge development and deployment.

View workflows →

Custom LLM Judges

Return to the overview to explore more judge features and capabilities.

Back to overview →

Why Integrate Judges with Datasets?​

Consistent Test Data

Ground Truth Comparison

Systematic Improvement

Version Control

Complete Example: Build → Evaluate → Improve​

Key Integration Points​

Dataset Operations​

Evaluation with Datasets​

Best Practices​

Learn More​