Evaluating Agents

AI Agents are an emerging pattern of GenAI applications that can use tools, make decisions, and execute multi-step workflows. However, evaluating the performance of those complex agents is challenging. MLflow provides a powerful toolkit to systematically evaluate the agent behavior precisely using traces and scorers.

Workflow

Build your agent

Create an AI agent with tools, instructions, and capabilities for your specific use case.

Create evaluation dataset

Design test cases with inputs and expectations for both outputs and agent behaviors like tool usage.

Define agent-specific scorers

Create scorers that evaluate multi-step agent behaviors using traces.

Run evaluation

Execute the evaluation and analyze both final outputs and intermediate agent behaviors in MLflow UI.

Example: Evaluating a Tool-Calling Agent

Prerequisites

First, install the required packages by running the following command:

pip install --upgrade mlflow>=3.3 openai

MLflow stores evaluation results in a tracking server. Connect your local environment to the tracking server by one of the following methods.

Local (pip)
Local (docker)
Remote MLflow Server
Databricks

For the fastest setup, you can install the mlflow Python package and run MLflow locally:

mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000

This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")

You can also brows the MLflow UI at http://localhost:5000.

MLflow provides a Docker Compose file to start a local MLflow server with a postgres database and a minio server.

git clone https://github.com/mlflow/mlflow.git
cd docker-compose
cp .env.dev.example .env
docker compose up -d

This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")

Refer to the instruction for more details, e.g., overriding the default environment variables.

If you have a remote MLflow tracking server, configure the connection:

import os
import mlflow

# Set your MLflow tracking URI
os.environ["MLFLOW_TRACKING_URI"] = "http://your-mlflow-server:5000"
# Or directly in code
mlflow.set_tracking_uri("http://your-mlflow-server:5000")

If you have a Databricks account, configure the connection:

import mlflow

mlflow.login()

This will prompt you for your configuration details (Databricks Host url and a PAT).

tip

If you are unsure about how to set up an MLflow tracking server, you can start with the cloud-based MLflow powered by Databricks: Sign up for free →

Step 1: Build an agent

Create a math agent that can use tools to answer questions. We use OpenAI Agents to build the tool-calling agent in a few lines of code.

from agents import Agent, Runner, function_tool


@function_tool
def add(a: float, b: float) -> float:
    """Adds two numbers."""
    return a + b


@function_tool
def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b


@function_tool
def modular(a: int, b: int) -> int:
    """Modular arithmetic"""
    return a % b


agent = Agent(
    name="Math Agent",
    instructions=(
        "You will be given a math question. Calculate the answer using the given calculator tools. "
        "Return the final number only as an integer."
    ),
    tools=[add, multiply, modular],
)

Make sure you can run the agent locally.

from agents import Runner

result = await Runner.run(agent, "What is 15% of 240?")
print(result.final_output)
# 36

Lastly, let's wrap it in a function that MLflow can call. Note that MLflow runs each prediction in a threadpool, so using a synchronous function does not slow down the evaluation.

from openai import OpenAI

# If you are using Jupyter Notebook, you need to apply nest_asyncio.
# import nest_asyncio
# nest_asyncio.apply()


def predict_fn(question: str) -> str:
    return Runner.run_sync(agent, question).final_output

Step 2: Create evaluation dataset

Design test cases as a list of dictionaries, each with an inputs and expectations field. We would like to evaluate the correctness of the output, but also the tool calls used by the agent.

eval_dataset = [
    {
        "inputs": {"task": "What is 15% of 240?"},
        "expectations": {"answer": 36, "tool_calls": ["multiply"]},
    },
    {
        "inputs": {
            "task": "I have 8 cookies and 3 friends. How many more cookies should I buy to share equally?"
        },
        "expectations": {"answer": 1, "tool_calls": ["modular", "add"]},
    },
    {
        "inputs": {
            "task": "I bought 2 shares of stock at $100 each. It's now worth $150. How much profit did I make?"
        },
        "expectations": {"answer": 100, "tool_calls": ["add", "multiply"]},
    },
]

Step 3: Define agent-specific scorers

Create scorers that evaluate agent-specific behaviors.

tip

MLflow's scorer can take the Trace from the agent execution. Trace is a powerful way to evaluate the agent's behavior precisely, not only the final output. For example, here we use the Trace.search_spans method to extract the order of tool calls and compare it with the expected tool calls.

For more details, see the Evaluate Traces guide.

from mlflow.entities import Feedback, SpanType, Trace
from mlflow.genai import scorer


@scorer
def exact_match(outputs, expectations) -> bool:
    return int(outputs) == expectations["answer"]


@scorer
def uses_correct_tools(trace: Trace, expectations: dict) -> Feedback:
    """Evaluate if agent used tools appropriately"""
    expected_tools = expectations["tool_calls"]

    # Parse the trace to get the actual tool calls
    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
    tool_names = [span.name for span in tool_spans]

    score = "yes" if tool_names == expected_tools else "no"
    rationale = (
        "The agent used the correct tools."
        if tool_names == expected_tools
        else f"The agent used the incorrect tools: {tool_names}"
    )
    # Return a Feedback object with the score and rationale
    return Feedback(value=score, rationale=rationale)

Step 4: Run the evaluation

Now we are ready to run the evaluation!

results = mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=predict_fn, scorers=[exact_match, uses_correct_tools]
)

Once the evaluation is done, open the MLflow UI in your browser and navigate to the experiment page. You should see MLflow creates a new Run and logs the evaluation results.

It seems the agent does not call tools in the correct order for the second test case. Let's click on the row to open the trace and inspect what happened under the hood.

By looking at the trace, we can figure out the agent computes the answer in three steps (1) compute 100 _ 2 (2) compute 150 _ 2 (3) subtract the two results. However, the more effective way is (1) subtract 100 from 150 (2) multiply the result by 2. In the next version, we can update the system instruction to use tools in a more effective way.

Configure parallelization

Running a complex agent can take a long time. MLflow by default uses background threadpool to speed up the evaluation process. You can configure the number of workers to use by setting the MLFLOW_GENAI_EVAL_MAX_WORKERS environment variable.

export MLFLOW_GENAI_EVAL_MAX_WORKERS=10

Evaluating MLflow Models

In MLflow 2.x, you can pass the model URI directly to the model argument of the legacy mlflow.evaluate() API (deprecated). The new GenAI evaluation API in MLflow 3.x still support evaluating MLflow Models, but the workflow is slightly different.

import mlflow

# Load the model **outside** the prediction function.
model = mlflow.pyfunc.load_model("models:/math_agent/1")


# Wrap the model in a function that MLflow can call.
def predict_fn(question: str) -> str:
    return model.predict(question)


# Run the evaluation as usual.
mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=predict_fn, scorers=[exact_match, uses_correct_tools]
)

Evaluating Agents

Workflow

Build your agent

Create evaluation dataset

Define agent-specific scorers

Run evaluation

Example: Evaluating a Tool-Calling Agent

Prerequisites

Step 1: Build an agent

Step 2: Create evaluation dataset

Step 3: Define agent-specific scorers

Step 4: Run the evaluation

Configure parallelization

Evaluating MLflow Models

Next steps

Customize Scorers

Evaluate Production Traces

Collect User Feedback

Workflow​

Build your agent

Create evaluation dataset

Define agent-specific scorers

Run evaluation

Example: Evaluating a Tool-Calling Agent​

Prerequisites​

Step 1: Build an agent​

Step 2: Create evaluation dataset​

Step 3: Define agent-specific scorers​

Step 4: Run the evaluation​

Configure parallelization​

Evaluating MLflow Models​

Next steps​

Customize Scorers

Evaluate Production Traces

Collect User Feedback

Workflow

Example: Evaluating a Tool-Calling Agent

Prerequisites

Step 1: Build an agent

Step 2: Create evaluation dataset

Step 3: Define agent-specific scorers

Step 4: Run the evaluation

Configure parallelization

Evaluating MLflow Models

Next steps