Skip to main content

Evaluation Datasets

Transform Your GenAI Testing with Structured Evaluation Data

Evaluation datasets are the foundation of systematic GenAI application testing. They provide a centralized way to manage test data, ground truth expectations, and evaluation results—enabling you to measure and improve the quality of your AI applications with confidence.

Quickstart: Build Your First Evaluation Dataset

There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process. Expectations are the cornerstone of effective evaluation—they define the ground truth against which your AI's outputs are measured, enabling systematic quality assessment across iterations.

import mlflow
from mlflow.genai.datasets import create_dataset

# Create your evaluation dataset
dataset = create_dataset(
name="production_validation_set",
experiment_id=["0"], # "0" is the default experiment
tags={"team": "ml-platform", "stage": "validation"},
)

# First, retrieve traces that will become the basis of the dataset
# Request list format to work with individual Trace objects
traces = mlflow.search_traces(
experiment_ids=["0"],
max_results=50,
filter_string="attributes.trace_name = 'chat_completion'",
return_type="list", # Returns list[Trace] for direct manipulation
)

# Add expectations to the traces
for trace in traces[:20]:
# Expectations can be structured metrics
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="output_quality",
value={"relevance": 0.95, "accuracy": 1.0, "contains_citation": True},
)

# They can also be specific expected text
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="expected_answer",
value="The correct answer should include step-by-step instructions for password reset with email verification",
)

# Retrieve the traces with added expectations
annotated_traces = mlflow.search_traces(
experiment_ids=["0"], max_results=100, return_type="list" # Get list[Trace] objects
)

# Merge the list of Trace objects directly into your dataset
dataset.merge_records(annotated_traces)

Why Evaluation Datasets?

Centralized Test Management

Store all your test cases, expected outputs, and evaluation criteria in one place. No more scattered CSV files or hardcoded test data.

Consistent Evaluation Source

Maintain a concrete representation of test data that can be used repeatedly as your project evolves. Eliminate manual testing and avoid repeatedly assembling evaluation data for each iteration.

Systematic Testing

Move beyond ad-hoc testing to systematic evaluation. Define clear expectations and measure performance consistently across deployments.

Collaborative Improvement

Enable your entire team to contribute test cases and expectations. Share evaluation datasets across projects and teams.

The Evaluation Loop

Evaluation datasets bridge the critical gap between trace generation and evaluation execution in the GenAI development lifecycle. As you test your application and capture traces with expectations, evaluation datasets transform these individual test cases into a materialized, reusable evaluation suite. This creates a consistent and evolving collection of evaluation records that grows with your application—each iteration adds new test cases while preserving the historical test coverage. Rather than losing valuable test scenarios after each development cycle, you build a comprehensive evaluation asset that can immediately assess the quality of changes and improvements to your implementation.

The Evaluation Loop

Iterate & Improve
Iterate on Code
Test App
Collect Traces
Add Expectations
Create Dataset
Run Evaluation
Analyze Results

Key Features

Ground Truth Management

Define and maintain expected outputs for your test cases. Capture expert knowledge about what constitutes correct behavior for your AI system.

Schema Evolution

Automatically track the structure of your test data as it evolves. Add new fields and test dimensions without breaking existing evaluations.

Incremental Updates

Continuously improve your test suite by adding new cases from production. Update expectations as your understanding of correct behavior evolves.

Flexible Tagging

Organize datasets with tags for easy discovery and filtering. Track metadata like data sources, annotation guidelines, and quality levels.

Performance Tracking

Monitor how your application performs against the same test data over time. Identify regressions and improvements across deployments.

Experiment Integration

Link datasets to MLflow experiments for complete traceability. Understand which test data was used for each model evaluation.

Core Concepts

Next Steps

Ready to improve your GenAI testing? Start with these resources: