Evaluating LLMs/Agents with MLflow

MLflow's evaluation and monitoring capabilities help you systematically measure, improve, and maintain the quality of your GenAI applications throughout their lifecycle from development through production.

A core tenet of MLflow's evaluation capabilities is Evaluation-Driven Development. This is an emerging practice to tackle the challenge of building high-quality LLM/Agentic applications. MLflow is an end-to-end platform that is designed to support this practice and help you deploy AI applications with confidence.

Key Capabilities

Evaluation Datasets

Overview
Quick Start

Your Test Data Foundation

Before you can evaluate your GenAI application, you need test data. Evaluation Datasets provide a centralized repository for managing test cases, ground truth expectations, and evaluation data at scale.

Think of Evaluation Datasets as your "test database" - a single source of truth for all the data needed to evaluate your AI systems. They transform ad-hoc testing into systematic quality assurance.

Learn more →

Key Capabilities

Centralized Management: Store all test data in one place
Production Data: Build from real traces
Ground Truth: Track expectations alongside inputs
Team Collaboration: Share datasets across projects

Build Your First Dataset in Minutes

Create evaluation datasets from production traces, manual test cases, or existing data. MLflow makes it simple to organize your test data and expectations for systematic evaluation.

Start with traces from production, add ground truth expectations, and transform them into reusable evaluation datasets that power your testing workflow.

Get started with datasets →

View SDK guide →

from mlflow.genai.datasets import create_dataset

# Create a dataset from production traces
dataset = create_dataset(
    name="customer_support_qa", experiment_id=["0"], tags={"stage": "validation"}
)

# Add test cases with expectations
dataset.merge_records(
    [
        {
            "inputs": {"question": "How to reset password?"},
            "expectations": {"contains_steps": True, "mentions_security": True},
        }
    ]
)

Feedback & Expectations

Systematic Evaluation
Human Feedback
LLM-as-a-Judge
Production Monitoring
Dataset Collection

Evaluate and Enhance quality

Systematically assessing and improving the quality of GenAI applications is a challenge. MLflow provides a comprehensive set of tools to help you evaluate and enhance the quality of your applications.

Being the industry's most-trusted experiment tracking platform, MLflow provides a strong foundation for tracking your evaluation results and effectively collaborating with your team.

Learn more →

Trace Evaluation

Running an Evaluation

Each evaluation is defined by three components:

Component Example

Component	Example
Dataset Inputs & expectations (and optionally pre-generated outputs and traces)	`[ {"inputs": {"question": "2+2"}, "expectations": {"answer": "4"}}, {"inputs": {"question": "2+3"}, "expectations": {"answer": "5"}} ]`
Scorer Evaluation criteria	`@scorer def exact_match(expectations, outputs): return expectations == outputs`
Predict Function Generates outputs for the dataset	`def predict_fn(question: str) -> str: response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": question}] ) return response.choices[0].message.content`

Dataset
Inputs & expectations (and optionally pre-generated outputs and traces)

[
  {"inputs": {"question": "2+2"}, "expectations": {"answer": "4"}},
  {"inputs": {"question": "2+3"}, "expectations": {"answer": "5"}}
]

Scorer
Evaluation criteria

@scorer
def exact_match(expectations, outputs):
    return expectations == outputs

Predict Function
Generates outputs for the dataset

def predict_fn(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

The following example shows a simple evaluation of a dataset of questions and expected answers.

import os
import openai
import mlflow
from mlflow.genai.scorers import Correctness, Guidelines

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 1. Define a simple QA dataset
dataset = [
    {
        "inputs": {"question": "Can MLflow manage prompts?"},
        "expectations": {"expected_response": "Yes!"},
    },
    {
        "inputs": {"question": "Can MLflow create a taco for my lunch?"},
        "expectations": {
            "expected_response": "No, unfortunately, MLflow is not a taco maker."
        },
    },
]


# 2. Define a prediction function to generate responses
def predict_fn(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content


# 3.Run the evaluation
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=predict_fn,
    scorers=[
        # Built-in LLM judge
        Correctness(),
        # Custom criteria using LLM judge
        Guidelines(name="is_english", guidelines="The answer must be in English"),
    ],
)

Review the results

Open the MLflow UI to review the evaluation results. If you are using OSS MLflow, you can use the following command to start the UI:

mlflow ui --port 5000

If you are using cloud-based MLflow, open the experiment page in the platform. You should see a new evaluation run is created under the "Runs" tab. Click on the run name to view the evaluation results.

Evaluating LLMs/Agents with MLflow

Key Capabilities

Evaluation Datasets

Your Test Data Foundation

Key Capabilities

Build Your First Dataset in Minutes

Feedback & Expectations

Evaluate and Enhance quality

Track Annotation and Human Feedbacks

Scale Quality Assessment with Automation

Monitor Applications in Production

Create a High-Quality Dataset from Real World Traffic

Running an Evaluation

Review the results

Next Steps

Quickstart

Evaluate Agents

Building Scorers

Key Capabilities​

Evaluation Datasets​

Your Test Data Foundation​

Key Capabilities

Build Your First Dataset in Minutes​

Feedback & Expectations​

Evaluate and Enhance quality​

Track Annotation and Human Feedbacks​

Scale Quality Assessment with Automation​

Monitor Applications in Production​

Create a High-Quality Dataset from Real World Traffic​

Running an Evaluation​

Review the results​

Next Steps​

Quickstart

Evaluate Agents

Building Scorers

Key Capabilities

Evaluation Datasets

Your Test Data Foundation

Build Your First Dataset in Minutes

Feedback & Expectations

Evaluate and Enhance quality

Track Annotation and Human Feedbacks

Scale Quality Assessment with Automation

Monitor Applications in Production

Create a High-Quality Dataset from Real World Traffic

Running an Evaluation

Review the results

Next Steps