Skip to main content

Evaluation Datasets SDK Guide

Master the APIs for creating, evolving, and managing evaluation datasets through practical workflows and real-world patterns.

Getting Started

MLflow provides a fluent API for working with evaluation datasets that makes common workflows simple and intuitive:

from mlflow.genai.datasets import (
create_dataset,
get_dataset,
search_datasets,
set_dataset_tags,
delete_dataset_tag,
)

Your Dataset Journey

Follow this typical workflow to build and evolve your evaluation datasets:

Complete Development Workflow

Continuous Improvement
Create/Get Dataset
Add Test Cases
Run Evaluation
Improve Code
Test & Trace
Update Dataset
Update Tags

Step 1: Create Your Dataset

Start by creating a new evaluation dataset with meaningful metadata using the mlflow.genai.datasets.create_dataset() API:

from mlflow.genai.datasets import create_dataset

# Create a new dataset with tags for organization
dataset = create_dataset(
name="customer_support_qa_v1",
experiment_id=["0"], # Link to experiments ("0" is default)
tags={
"version": "1.0",
"purpose": "regression_testing",
"model": "gpt-4",
"team": "ml-platform",
"status": "development",
},
)

Step 2: Add Your First Test Cases

Build your dataset by adding test cases from production traces and manual curation. Expectations are typically defined by subject matter experts (SMEs) who understand the domain and can establish ground truth for what constitutes correct behavior.

Learn how to define expectations → Expectations are the ground truth values that define what your AI should produce. They're added by SMEs who review outputs and establish quality standards.

import mlflow

# Search for production traces to build your dataset
# Request list format to work with individual Trace objects
production_traces = mlflow.search_traces(
experiment_ids=["0"], # Your production experiment
filter_string="attributes.user_feedback = 'positive'",
max_results=100,
return_type="list", # Returns list[Trace] for direct manipulation
)

# Subject matter experts add expectations to define correct behavior
for trace in production_traces:
# Subject matter experts review traces and define what the output should satisfy
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="quality_assessment",
value={
"should_match_production": True,
"minimum_quality": 0.8,
"response_time_ms": 2000,
"contains_citation": True,
},
)

# Can also add textual expectations
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="expected_behavior",
value="Response should provide step-by-step instructions with security considerations",
)

# Add annotated traces to dataset (expectations are automatically included)
dataset.merge_records(production_traces)

Step 3: Evolve Your Dataset

As you discover edge cases and improve your understanding, continuously update your dataset. The mlflow.entities.EvaluationDataset.merge_records() method intelligently handles both new records and updates to existing ones:

# Capture a production failure
failure_case = {
"inputs": {"question": "'; DROP TABLE users; --", "user_type": "malicious"},
"expectations": {
"handles_sql_injection": True,
"returns_safe_response": True,
"logs_security_event": True,
},
"source": {
"source_type": "HUMAN",
"source_data": {"discovered_by": "security_team"},
},
"tags": {"category": "security", "severity": "critical"},
}

# Add the new edge case
dataset.merge_records([failure_case])

# Update expectations for existing records
updated_records = []
for record in dataset.records:
if "accuracy" in record.get("expectations", {}):
# Raise the quality bar
record["expectations"]["accuracy"] = max(
0.9, record["expectations"]["accuracy"]
)
updated_records.append(record)

# Merge updates (intelligently handles duplicates)
dataset.merge_records(updated_records)

Step 4: Organize with Tags

Use tags to track dataset evolution and enable powerful searches. Learn more about mlflow.search_traces() to build your datasets from production data:

from mlflow.genai.datasets import set_dataset_tags

# Update dataset metadata
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={
"status": "validated",
"coverage": "comprehensive",
"last_review": "2024-11-01",
},
)

# Remove outdated tags
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={"development_only": None}, # Setting to None removes the tag
)

Step 5: Search and Discover

Find datasets using powerful search capabilities with mlflow.genai.datasets.search_datasets():

from mlflow.genai.datasets import search_datasets

# Find datasets by experiment
datasets = search_datasets(experiment_ids=["0", "1"]) # Search in multiple experiments

# Search by name pattern
regression_datasets = search_datasets(filter_string="name LIKE '%regression%'")

# Complex search with tags
production_ready = search_datasets(
filter_string="tags.status = 'validated' AND tags.coverage = 'comprehensive'",
order_by=["last_update_time DESC"],
max_results=10,
)

# The PagedList automatically handles pagination when iterating

Common Filter String Examples

Here are practical examples of filter strings to help you find the right datasets:

Filter ExpressionDescriptionUse Case
name = 'production_qa'Exact name matchFind a specific dataset
name LIKE '%test%'Pattern matchingFind all test datasets
tags.status = 'validated'Tag equalityFind production-ready datasets
tags.version = '2.0' AND tags.team = 'ml'Multiple tag conditionsFind team-specific versions
created_by = 'alice@company.com'Creator filterFind datasets by author
created_time > 1698800000000Time-based filterFind recent datasets
tags.model = 'gpt-4' AND name LIKE '%eval%'Combined conditionsModel-specific evaluation sets
last_updated_by != 'bot@system'Exclusion filterExclude automated updates

Step 6: Manage Experiment Associations

Datasets can be dynamically associated with experiments after creation using mlflow.genai.datasets.add_dataset_to_experiments() and mlflow.genai.datasets.remove_dataset_from_experiments().

This functionality enables several important use cases:

  • Cross-team collaboration: Share datasets across teams by adding their experiment IDs
  • Lifecycle management: Remove outdated experiment associations as projects mature
  • Project reorganization: Dynamically reorganize datasets as your project structure evolves
from mlflow.genai.datasets import (
add_dataset_to_experiments,
remove_dataset_from_experiments,
)

# Add dataset to additional experiments
dataset = add_dataset_to_experiments(
dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3", "4", "5"]
)
print(f"Dataset now linked to experiments: {dataset.experiment_ids}")

# Remove dataset from specific experiments
dataset = remove_dataset_from_experiments(
dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3"]
)
print(f"Updated experiment associations: {dataset.experiment_ids}")

The Active Record Pattern

The EvaluationDataset object follows an active record pattern—it's both a data container and provides methods to interact with the backend:

# Get a dataset
dataset = get_dataset(dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890")

# The dataset object is "live" - it can fetch and update data
current_record_count = len(dataset.records) # Lazy loads if needed

# Add new records directly on the object
new_records = [
{
"inputs": {"question": "What are your business hours?"},
"expectations": {"mentions_hours": True, "includes_timezone": True},
}
]
dataset.merge_records(new_records) # Updates backend immediately

# Convert to DataFrame for analysis
df = dataset.to_df()
# Access auto-computed properties
schema = dataset.schema # Field structure
profile = dataset.profile # Dataset statistics

How Record Merging Works

The merge_records() method intelligently handles both new records and updates to existing ones. Records are matched based on a hash of their inputs - if a record with identical inputs already exists, its expectations and tags will be updated rather than creating a duplicate record.

When you add records for the first time, they're stored with their inputs, expectations, and metadata:

# Initial record
record_v1 = {
"inputs": {"question": "What is MLflow?", "context": "ML platform overview"},
"expectations": {"accuracy": 0.8, "mentions_tracking": True},
}

dataset.merge_records([record_v1])
# Creates a new record in the dataset

Understanding Source Types

MLflow tracks the provenance of each record in your evaluation dataset through source types. This helps you understand where your test data came from and analyze performance by data source.

Source Type Behavior

Automatic Inference

MLflow automatically infers source types based on record characteristics when no explicit source is provided.

Manual Override

You can always specify explicit source information to override automatic inference.

Provenance Tracking

Source types enable filtering and analysis of performance by data origin.

Automatic Source Assignment

MLflow automatically assigns source types based on the characteristics of your records:

Records created from MLflow traces are automatically assigned the TRACE source type:

# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
dataset.merge_records(traces) # All records get TRACE source type

# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame
dataset.merge_records(
traces_df
) # Automatically detects traces and assigns TRACE source

Manual Source Specification

You can explicitly specify the source type and metadata for any record. When no explicit source is provided, MLflow automatically infers the source type before sending records to the backend using these rules:

  • Records with expectations → Inferred as HUMAN source (indicates manual annotation or ground truth)
  • Records with only inputs (no expectations) → Inferred as CODE source (indicates programmatic generation)
  • Records from traces → Always marked as TRACE source (regardless of expectations)

This inference happens client-side in the merge_records() method before records are sent to the tracking backend. You can override this automatic inference by providing explicit source information:

# Specify HUMAN source for manually curated test cases
human_curated = {
"inputs": {"question": "What are your business hours?"},
"expectations": {"accuracy": 1.0, "includes_timezone": True},
"source": {
"source_type": "HUMAN",
"source_data": {"curator": "support_team", "date": "2024-11-01"},
},
}

# Specify DOCUMENT source for data from documentation
from_docs = {
"inputs": {"question": "How to install MLflow?"},
"expectations": {"mentions_pip": True, "mentions_conda": True},
"source": {
"source_type": "DOCUMENT",
"source_data": {"document_id": "install_guide", "page": 1},
},
}

# Specify CODE source for programmatically generated data
generated = {
"inputs": {"question": f"Test question {i}" for i in range(100)},
"source": {
"source_type": "CODE",
"source_data": {"generator": "test_suite_v2", "seed": 42},
},
}

dataset.merge_records([human_curated, from_docs, generated])

Available Source Types

Source types enable powerful filtering and analysis of your evaluation results. You can analyze performance by data origin to understand if your model performs differently on human-curated vs. generated test cases, or production traces vs. documentation examples.

TRACE

Production data captured via MLflow tracing - automatically assigned when adding traces

HUMAN

Subject matter expert annotations - inferred for records with expectations

CODE

Programmatically generated tests - inferred for records without expectations

DOCUMENT

Test cases from documentation or specs - must be explicitly specified

UNSPECIFIED

Source unknown or not provided - for legacy or imported data

Analyzing Data by Source

# Convert dataset to DataFrame for analysis
df = dataset.to_df()

# Check source type distribution
source_distribution = df["source_type"].value_counts()
print("Data sources in dataset:")
for source_type, count in source_distribution.items():
print(f" {source_type}: {count} records")

Search Filter Reference

Use these fields in your filter strings. Note: The fluent API returns a PagedList that can be iterated directly - pagination is handled automatically when you iterate over the results.

FieldTypeExample
namestringname = 'production_tests'
tags.<key>stringtags.status = 'validated'
created_bystringcreated_by = 'alice@company.com'
last_updated_bystringlast_updated_by = 'bob@company.com'
created_timetimestampcreated_time > 1698800000000
last_update_timetimestamplast_update_time > 1698800000000

Filter Operators

  • =, !=: Exact match
  • LIKE, ILIKE: Pattern matching with % wildcard (ILIKE is case-insensitive)
  • >, <, >=, <=: Numeric/timestamp comparison
  • AND: Combine conditions (OR is not currently supported for evaluation datasets)
# Complex filter example
datasets = search_datasets(
filter_string="""
tags.status = 'production'
AND name LIKE '%customer%'
AND created_time > 1698800000000
""",
order_by=["last_update_time DESC"],
)

Using the Client API

For applications and advanced use cases, you can also use the MlflowClient API which provides the same functionality with an object-oriented interface:

from mlflow import MlflowClient

client = MlflowClient()

# Create a dataset
dataset = client.create_dataset(
name="customer_support_qa",
experiment_id=["0"],
tags={"version": "1.0", "team": "ml-platform"},
)

The client API provides the same capabilities as the fluent API but is better suited for:

  • Production applications that need explicit client management
  • Scenarios requiring custom tracking URIs or authentication
  • Integration with existing MLflow client-based workflows

Next Steps