Skip to main content

Evaluate Conversations

Conversation evaluation enables you to assess entire conversation sessions rather than individual turns. This is essential for evaluating conversational AI systems where quality emerges over multiple interactions, such as user frustration patterns, conversation completeness, or overall dialogue coherence.

Multi-turn evaluation results
Experimental Feature

Multi-turn evaluation is experimental in MLflow 3.10.0. The API and behavior may change in future releases.

Two Approaches

MLflow supports two approaches for evaluating conversations:

Evaluate existing conversations that have already been traced.

Use this when you have:

  • Production conversation data to analyze
  • Pre-recorded test conversations
  • Conversations from a previous agent version
python
import os
import mlflow
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration

os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key

# Get existing traces grouped by session
traces = mlflow.search_traces(
experiment_ids=["<your-experiment-id>"],
return_type="list",
)

# Evaluate the conversations
results = mlflow.genai.evaluate(
data=traces,
scorers=[ConversationCompleteness(), UserFrustration()],
)

Overview

Traditional single-turn evaluation assesses each agent response independently. However, many important qualities can only be evaluated by examining the full conversation:

  • User Frustration: Did the user become frustrated? Was it resolved?
  • Conversation Completeness: Were all user questions answered by the end of the conversation?
  • Knowledge Retention: Does the agent remember information from earlier in the conversation?
  • Dialogue Coherence: Does the conversation flow naturally?

Multi-turn evaluation addresses these needs by grouping traces into conversation sessions and applying judges that analyze the entire conversation history.

Prerequisites

First, install the required packages by running the following command:

bash
pip install --upgrade 'mlflow[genai]>=3.10'

MLflow stores evaluation results in a tracking server. Connect your local environment to the tracking server by one of the following methods.

Install the Python package manager uv (that will also install uvx command to invoke Python tools without installing them).

Start a MLflow server locally.

shell
uvx mlflow server

Finally, configure your LLM provider credentials. The built-in multi-turn scorers default to using OpenAI:

python
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key

Evaluating Pre-Generated Conversations

Evaluate conversations that have already been traced. This is useful for evaluating production data or pre-recorded test conversations.

Step 1: Tag traces with session IDs

When building your agent, set session IDs on traces to group them into conversations:

python
import mlflow


@mlflow.trace
def my_chatbot(question, session_id):
mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id})
return generate_response(question)
Sessions View UI

Step 2: Retrieve and evaluate sessions

Get traces from your experiment and pass them to mlflow.genai.evaluate. MLflow automatically groups traces by session ID:

python
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration

# Get all traces
traces = mlflow.search_traces(
experiment_ids=["<your-experiment-id>"],
return_type="list",
)

# Evaluate all sessions - MLflow automatically groups by session ID
results = mlflow.genai.evaluate(
data=traces,
scorers=[
ConversationCompleteness(),
UserFrustration(),
],
)

You can also retrieve complete sessions directly using mlflow.search_sessions:

python
import mlflow

# Get complete sessions (each session is a list of traces)
sessions = mlflow.search_sessions(
locations=["<your-experiment-id>"],
max_results=50,
)

# Flatten for evaluation
all_traces = [trace for session in sessions for trace in session]

results = mlflow.genai.evaluate(
data=all_traces,
scorers=[ConversationCompleteness(), UserFrustration()],
)

Simulating Conversations during Evaluation

Generate new conversations by simulating user interactions. This enables testing different agent versions with consistent goals and personas.

python
import mlflow
from mlflow.genai.simulators import ConversationSimulator
from mlflow.genai.scorers import ConversationCompleteness, Safety

# Define test scenarios
simulator = ConversationSimulator(
test_cases=[
{"goal": "Get help setting up experiment tracking"},
{"goal": "Troubleshoot a model deployment error"},
{
"goal": "Learn about model versioning",
"persona": "You are a beginner who needs detailed explanations",
},
],
max_turns=5,
)


# Your agent's predict function
def predict_fn(input: list[dict], **kwargs) -> str:
# input is the conversation history
response = your_agent.chat(input)
return response


# Simulate conversations and evaluate
results = mlflow.genai.evaluate(
data=simulator,
predict_fn=predict_fn,
scorers=[
ConversationCompleteness(),
Safety(),
],
)

For complete documentation on conversation simulation, including test case definition, predict function interfaces, and configuration options, see the Conversation Simulation guide.

Multi-Turn Judges

Built-in Judges

MLflow provides built-in multi-turn judges including ConversationCompleteness, UserFrustration, KnowledgeRetention, and more. See the Built-in Judges page for the full list, usage examples, and API documentation.

Custom Judges

You can create custom multi-turn judges using make_judge with the {{ conversation }} template variable:

python
from mlflow.genai.judges import make_judge
from typing import Literal

# Create a custom multi-turn judge
politeness_judge = make_judge(
name="conversation_politeness",
instructions=(
"Analyze the {{ conversation }} and determine if the agent maintains "
"a polite and professional tone throughout all interactions. "
"Rate as 'consistently_polite', 'mostly_polite', or 'impolite'."
),
feedback_value_type=Literal["consistently_polite", "mostly_polite", "impolite"],
model="openai:/gpt-5-mini",
)

# Use in evaluation
results = mlflow.genai.evaluate(
data=traces,
scorers=[politeness_judge],
)
Conversation Template Variable

The {{ conversation }} variable injects the complete conversation history in a structured format.

The variable can only be used with {{ expectations }}, not with {{ inputs }}, {{ outputs }}, or {{ trace }}.

How Assessments Are Stored

Multi-turn assessments are stored on the first trace (chronologically) in each session. This design ensures:

  • Assessments remain stable even as new turns are added to a conversation
  • You can easily find conversation-level assessments by looking at session start traces
  • The Sessions UI can efficiently display conversation metrics

Assessments include metadata identifying them as conversation-level:

  • session_id: The session ID linking the assessment to the full conversation

Working with Specific Sessions

To evaluate a specific session, use mlflow.search_traces with a filter string:

python
import mlflow
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration

# Get traces for a specific session using filter
traces = mlflow.search_traces(
experiment_ids=["<your-experiment-id>"],
filter_string="metadata.`mlflow.trace.session` = '<your-session-id>'",
return_type="list",
)

# Evaluate the session
results = mlflow.genai.evaluate(
data=traces,
scorers=[ConversationCompleteness(), UserFrustration()],
)

Next Steps