Evaluating Agents
AI Agents are an emerging pattern of GenAI applications that can use tools, make decisions, and execute multi-step workflows. However, evaluating the performance of those complex agents is challenging. MLflow provides a powerful toolkit to systematically evaluate the agent behavior precisely using traces and scorers.

Workflow
Build your agent
Create an AI agent with tools, instructions, and capabilities for your specific use case.
Create evaluation dataset
Design test cases with inputs and expectations for both outputs and agent behaviors like tool usage.
Define agent-specific scorers
Create scorers that evaluate multi-step agent behaviors using traces.
Run evaluation
Execute the evaluation and analyze both final outputs and intermediate agent behaviors in MLflow UI.
Example: Evaluating a Tool-Calling Agent
Prerequisites
First, install the required packages by running the following command:
pip install --upgrade mlflow>=3.3 openai
MLflow stores evaluation results in a tracking server. Connect your local environment to the tracking server by one of the following methods.
- Local (pip)
- Local (docker)
- Remote MLflow Server
- Databricks
For the fastest setup, you can install the mlflow Python package and run MLflow locally:
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
You can also brows the MLflow UI at http://localhost:5000.
MLflow provides a Docker Compose file to start a local MLflow server with a postgres database and a minio server.
git clone https://github.com/mlflow/mlflow.git
cd docker-compose
cp .env.dev.example .env
docker compose up -d
This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
Refer to the instruction for more details, e.g., overriding the default environment variables.
If you have a remote MLflow tracking server, configure the connection:
import os
import mlflow
# Set your MLflow tracking URI
os.environ["MLFLOW_TRACKING_URI"] = "http://your-mlflow-server:5000"
# Or directly in code
mlflow.set_tracking_uri("http://your-mlflow-server:5000")
If you have a Databricks account, configure the connection:
import mlflow
mlflow.login()
This will prompt you for your configuration details (Databricks Host url and a PAT).
If you are unsure about how to set up an MLflow tracking server, you can start with the cloud-based MLflow powered by Databricks: Sign up for free →
Step 1: Build an agent
Create a math agent that can use tools to answer questions. We use OpenAI Agents to build the tool-calling agent in a few lines of code.
from agents import Agent, Runner, function_tool
@function_tool
def add(a: float, b: float) -> float:
"""Adds two numbers."""
return a + b
@function_tool
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
@function_tool
def modular(a: int, b: int) -> int:
"""Modular arithmetic"""
return a % b
agent = Agent(
name="Math Agent",
instructions=(
"You will be given a math question. Calculate the answer using the given calculator tools. "
"Return the final number only as an integer."
),
tools=[add, multiply, modular],
)
Make sure you can run the agent locally.
from agents import Runner
result = await Runner.run(agent, "What is 15% of 240?")
print(result.final_output)
# 36
Lastly, let's wrap it in a function that MLflow can call. Note that MLflow runs each prediction in a threadpool, so using a synchronous function does not slow down the evaluation.
from openai import OpenAI
# If you are using Jupyter Notebook, you need to apply nest_asyncio.
# import nest_asyncio
# nest_asyncio.apply()
def predict_fn(question: str) -> str:
return Runner.run_sync(agent, question).final_output
Step 2: Create evaluation dataset
Design test cases as a list of dictionaries, each with an inputs
and expectations
field. We would like to evaluate the correctness of the output, but also the tool calls used by the agent.
eval_dataset = [
{
"inputs": {"task": "What is 15% of 240?"},
"expectations": {"answer": 36, "tool_calls": ["multiply"]},
},
{
"inputs": {
"task": "I have 8 cookies and 3 friends. How many more cookies should I buy to share equally?"
},
"expectations": {"answer": 1, "tool_calls": ["modular", "add"]},
},
{
"inputs": {
"task": "I bought 2 shares of stock at $100 each. It's now worth $150. How much profit did I make?"
},
"expectations": {"answer": 100, "tool_calls": ["add", "multiply"]},
},
]
Step 3: Define agent-specific scorers
Create scorers that evaluate agent-specific behaviors.
MLflow's scorer can take the Trace from the agent execution. Trace is a powerful way to evaluate the agent's behavior precisely, not only the final output. For example, here we use the Trace.search_spans
method to extract the order of tool calls and compare it with the expected tool calls.
For more details, see the Evaluate Traces guide.
from mlflow.entities import Feedback, SpanType, Trace
from mlflow.genai import scorer
@scorer
def exact_match(outputs, expectations) -> bool:
return int(outputs) == expectations["answer"]
@scorer
def uses_correct_tools(trace: Trace, expectations: dict) -> Feedback:
"""Evaluate if agent used tools appropriately"""
expected_tools = expectations["tool_calls"]
# Parse the trace to get the actual tool calls
tool_spans = trace.search_spans(span_type=SpanType.TOOL)
tool_names = [span.name for span in tool_spans]
score = "yes" if tool_names == expected_tools else "no"
rationale = (
"The agent used the correct tools."
if tool_names == expected_tools
else f"The agent used the incorrect tools: {tool_names}"
)
# Return a Feedback object with the score and rationale
return Feedback(value=score, rationale=rationale)
Step 4: Run the evaluation
Now we are ready to run the evaluation!
results = mlflow.genai.evaluate(
data=eval_dataset, predict_fn=predict_fn, scorers=[exact_match, uses_correct_tools]
)
Once the evaluation is done, open the MLflow UI in your browser and navigate to the experiment page. You should see MLflow creates a new Run and logs the evaluation results.

It seems the agent does not call tools in the correct order for the second test case. Let's click on the row to open the trace and inspect what happened under the hood.

By looking at the trace, we can figure out the agent computes the answer in three steps (1) compute 100 _ 2 (2) compute 150 _ 2 (3) subtract the two results. However, the more effective way is (1) subtract 100 from 150 (2) multiply the result by 2. In the next version, we can update the system instruction to use tools in a more effective way.
Configure parallelization
Running a complex agent can take a long time. MLflow by default uses background threadpool to speed up the evaluation process. You can configure the number of workers to use by setting the MLFLOW_GENAI_EVAL_MAX_WORKERS
environment variable.
export MLFLOW_GENAI_EVAL_MAX_WORKERS=10
Evaluating MLflow Models
In MLflow 2.x, you can pass the model URI directly to the model
argument of the legacy mlflow.evaluate()
API (deprecated). The new GenAI evaluation API in MLflow 3.x still support evaluating MLflow Models, but the workflow is slightly different.
import mlflow
# Load the model **outside** the prediction function.
model = mlflow.pyfunc.load_model("models:/math_agent/1")
# Wrap the model in a function that MLflow can call.
def predict_fn(question: str) -> str:
return model.predict(question)
# Run the evaluation as usual.
mlflow.genai.evaluate(
data=eval_dataset, predict_fn=predict_fn, scorers=[exact_match, uses_correct_tools]
)
Next steps
Customize Scorers
Build advanced evaluation criteria and metrics specifically designed for agent behaviors and tool usage patterns.
Evaluate Production Traces
Analyze real agent executions in production environments to understand performance and identify improvement opportunities.
Collect User Feedback
Gather human feedback on agent performance to create training data and improve evaluation accuracy.