Skip to main content

Evaluation Datasets SDK Reference

Complete API reference for creating, managing, and querying evaluation datasets programmatically.

For general information and examples of how to use evaluation datasets, see the section on running evaluations.

SQL Backend Required

Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available with FileStore (local file system-based tracking).

Creating a Dataset

Use mlflow.genai.datasets.create_dataset() to create a new evaluation dataset:

python
from mlflow.genai.datasets import create_dataset

# Create a new dataset
dataset = create_dataset(
name="customer_support_qa",
experiment_id=["0"], # Link to experiments
tags={"version": "1.0", "team": "ml-platform", "status": "active"},
)

print(f"Created dataset: {dataset.dataset_id}")

You can also use the mlflow.tracking.MlflowClient() API:

python
from mlflow import MlflowClient

client = MlflowClient()
dataset = client.create_dataset(
name="customer_support_qa",
experiment_id=["0"],
tags={"version": "1.0"},
)

Adding Records to a Dataset

Use the mlflow.entities.EvaluationDataset.merge_records() method to add new records to your dataset. Records can be added from dictionaries, DataFrames, or traces:

Add records directly from Python dictionaries:

python
# Add records with inputs and expectations (ground truth)
new_records = [
{
"inputs": {"question": "What are your business hours?"},
"expectations": {
"expected_answer": "We're open Monday-Friday 9am-5pm EST",
"must_mention_hours": True,
"must_include_timezone": True,
},
},
{
"inputs": {"question": "How do I reset my password?"},
"expectations": {
"expected_answer": (
"Click 'Forgot Password' and follow the email instructions"
),
"must_include_steps": True,
},
},
]

dataset.merge_records(new_records)
print(f"Dataset now has {len(dataset.records)} records")

Evaluation dataset schema

Evaluation datasets must use the schema described in this section.

Core fields

The following fields are used in both the evaluation dataset abstraction or if you pass data directly.

ColumnData TypeDescriptionRequired
inputsdict[Any, Any]Inputs for your app (e.g., user question, context), stored as a JSON-seralizable dict.Yes
expectationsdict[Str, Any]Ground truth labels, stored as a JSON-seralizable dict.Optional

expectations reserved keys

expectations has several reserved keys that are used by built-in LLM judges: guidelines, expected_facts, and expected_response.

FieldUsed byDescription
expected_factsCorrectness judgeList of facts that should appear
expected_responseCorrectness judgeExact or similar expected output
guidelinesGuidelines judgeNatural language rules to follow
expected_retrieved_contextdocument_recall scorerDocuments that should be retrieved

Additional fields

The following fields are used by the evaluation dataset abstraction to track lineage and version history.

ColumnData TypeDescriptionRequired
dataset_record_idstringThe unique identifier for the record.Automatically set if not provided.
create_timetimestampThe time when the record was created.Automatically set when inserting or updating.
created_bystringThe user who created the record.Automatically set when inserting or updating.
last_update_timetimestampThe time when the record was last updated.Automatically set when inserting or updating.
last_updated_bystringThe user who last updated the record.Automatically set when inserting or updating.
sourcestructThe source of the dataset record. See Source field.Optional
tagsdict[str, Any]Key-value tags for the dataset record.Optional

Source field

The source field tracks where a dataset record came from. Each record can have only one source type.

Human source: Record created manually by a person

python
{
"source": {
"human": {"user_name": "jane.doe@company.com"} # user who created the record
}
}

Document source: Record synthesized from a document

python
{
"source": {
"document": {
"doc_uri": "s3://bucket/docs/product-manual.pdf", # URI or path to the source document
"content": "The first 500 chars of the document...", # Optional, excerpt or full content from the document
}
}
}

Trace source: Record created from a production trace

python
{
"source": {
"trace": {
"trace_id": "tr-abc123def456", # unique identifier of the source trace
}
}
}

Updating Existing Records

The mlflow.entities.EvaluationDataset.merge_records() method intelligently handles updates. Records are matched based on a hash of their inputs - if a record with identical inputs already exists, its expectations and tags are merged rather than creating a duplicate:

python
# Initial record
dataset.merge_records(
[
{
"inputs": {"question": "What is MLflow?"},
"expectations": {
"expected_answer": "MLflow is a platform for ML",
"must_mention_tracking": True,
},
}
]
)

# Update with same inputs but enhanced expectations
dataset.merge_records(
[
{
"inputs": {"question": "What is MLflow?"}, # Same inputs = update
"expectations": {
# Updates existing value
"expected_answer": (
"MLflow is an open-source platform for managing the ML lifecycle"
),
"must_mention_models": True, # Adds new expectation
# Note: "must_mention_tracking": True is preserved
},
}
]
)

# Result: One record with merged expectations

Retrieving Datasets

Retrieve existing datasets by ID or search for them:

python
from mlflow.genai.datasets import get_dataset

# Get a specific dataset by ID
dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f")

# Access dataset properties
print(f"Name: {dataset.name}")
print(f"Records: {len(dataset.records)}")
print(f"Schema: {dataset.schema}")
print(f"Tags: {dataset.tags}")

Managing Tags

Add, update, or remove tags from datasets:

python
from mlflow.genai.datasets import set_dataset_tags, delete_dataset_tag

# Set or update tags
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={"status": "production", "validated": "true", "version": "2.0"},
)

# Delete a specific tag
delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")

Deleting a Dataset

Permanently delete a dataset and all its records:

python
from mlflow.genai.datasets import delete_dataset

# Delete the entire dataset
delete_dataset(dataset_id="d-1a2b3c4d5e6f7890")
warning

Dataset deletion is permanent and cannot be undone. All records will be deleted.

Working with Dataset Records

The mlflow.entities.EvaluationDataset() object provides several ways to access and analyze records:

python
# Access all records
all_records = dataset.records

# Convert to DataFrame for analysis
df = dataset.to_df()
print(df.head())

# Serialize to a dictionary (e.g., for JSON export)
dataset_dict = dataset.to_dict()

# View dataset schema (inferred from the records)
print(dataset.schema)

# View dataset profile (statistics)
print(dataset.profile)

# Get record count
print(f"Total number of records: {len(dataset.records)}")

To recreate a dataset from a serialized dictionary:

python
from mlflow.genai.datasets import EvaluationDataset

dataset = EvaluationDataset.from_dict(dataset_dict)

Advanced Topics

Understanding Input Uniqueness

Records are considered unique based on their entire inputs dictionary. Even small differences create separate records:

python
# These are treated as different records due to different inputs
record_a = {
"inputs": {"question": "What is MLflow?", "temperature": 0.7},
"expectations": {"expected_answer": "MLflow is an ML platform"},
}

record_b = {
"inputs": {
"question": "What is MLflow?",
"temperature": 0.8,
}, # Different temperature
"expectations": {"expected_answer": "MLflow is an ML platform"},
}

dataset.merge_records([record_a, record_b])
# Results in 2 separate records due to different temperature values

Source Type Inference

MLflow automatically assigns source types before sending records to the backend using these rules:

Automatic Inference

MLflow automatically infers source types based on record characteristics when no explicit source is provided.

Client-Side Processing

Source type inference happens in merge_records() before records are sent to the tracking backend.

Manual Override

You can always specify explicit source information to override automatic inference.

Inference Rules

Records from MLflow traces are automatically assigned the TRACE source type:

python
# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(locations=["0"], return_type="list")
dataset.merge_records(traces)

# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame
# Automatically detects traces and assigns TRACE source
dataset.merge_records(traces_df)

Manual Source Override

You can explicitly specify the source type and metadata for any record:

python
# Specify HUMAN source with metadata
human_curated = {
"inputs": {"question": "What are your business hours?"},
"expectations": {
"expected_answer": "We're open Monday-Friday 9am-5pm EST",
"must_include_timezone": True,
},
"source": {
"source_type": "HUMAN",
"source_data": {"curator": "support_team", "date": "2024-11-01"},
},
}

# Specify DOCUMENT source
from_docs = {
"inputs": {"question": "How to install MLflow?"},
"expectations": {
"expected_answer": "pip install mlflow",
"must_mention_pip": True,
},
"source": {
"source_type": "DOCUMENT",
"source_data": {"document_id": "install_guide", "page": 1},
},
}

dataset.merge_records([human_curated, from_docs])

Available Source Types

TRACE

Production data captured via MLflow tracing - automatically assigned when adding traces

HUMAN

Subject matter expert annotations - inferred for records with expectations

CODE

Programmatically generated tests - inferred for records without expectations

DOCUMENT

Test cases from documentation or specs - must be explicitly specified

UNSPECIFIED

Source unknown or not provided - for legacy or imported data

Search Filter Reference

Searchable Fields

FieldTypeExample
namestringname = 'production_tests'
tags.<key>stringtags.status = 'validated'
created_bystringcreated_by = 'alice@company.com'
last_updated_bystringlast_updated_by = 'bob@company.com'
created_timetimestampcreated_time > 1698800000000
last_update_timetimestamplast_update_time > 1698800000000

Filter Operators

  • =, !=: Exact match
  • LIKE, ILIKE: Pattern matching with % wildcard (ILIKE is case-insensitive)
  • >, <, >=, <=: Numeric/timestamp comparison
  • AND: Combine conditions (OR is not currently supported)

Common Filter Examples

Filter ExpressionDescriptionUse Case
name = 'production_qa'Exact name matchFind a specific dataset
name LIKE '%test%'Pattern matchingFind all test datasets
tags.status = 'validated'Tag equalityFind production-ready datasets
tags.version = '2.0' AND tags.team = 'ml'Multiple tag conditionsFind team-specific versions
created_by = 'alice@company.com'Creator filterFind datasets by author
created_time > 1698800000000Time-based filterFind recent datasets
python
# Complex filter example
datasets = search_datasets(
filter_string="""
tags.status = 'production'
AND name LIKE '%customer%'
AND created_time > 1698800000000
""",
order_by=["last_update_time DESC"],
)

Next Steps