Evaluating a Multi-Turn Conversational Agent

March 18, 2026 · 6 min read

Evaluate a customer support chat agent across full conversation sessions using MLflow's conversational scorers. Single-turn evaluation misses problems that only surface over multiple exchanges -- frustrated users, incomplete resolutions, and guideline drift.

Prerequisites

pip install mlflow openai

The agent maintains conversation history per session and responds to customer support inquiries about orders, returns, and account issues.

import mlflow
import openai

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("multi-turn-agent-eval")
mlflow.openai.autolog()

client = openai.OpenAI()

SYSTEM_PROMPT = (
    "You are a customer support agent for ShopFast, "
    "an online retail company. Follow these rules:\n"
    "1. Always greet the customer by name if provided.\n"
    "2. For order status questions, ask for the order "
    "number if not provided.\n"
    "3. Never promise specific refund timelines.\n"
    "4. Escalate to a human agent if the customer asks "
    "more than twice about the same unresolved issue.\n"
    "5. Always end with asking if there's anything else "
    "you can help with."
)

# Store conversation histories keyed by session_id
conversation_histories: dict[str, list[dict]] = {}


@mlflow.trace
def chat(message: str, session_id: str) -> str:
    mlflow.update_current_trace(
        metadata={"mlflow.trace.session": session_id}
    )

    if session_id not in conversation_histories:
        conversation_histories[session_id] = [
            {"role": "system", "content": SYSTEM_PROMPT}
        ]

    conversation_histories[session_id].append(
        {"role": "user", "content": message}
    )

    response = client.chat.completions.create(
        model="gpt-5.4-mini",
        messages=conversation_histories[session_id],
    )

    assistant_message = response.choices[0].message.content
    conversation_histories[session_id].append(
        {"role": "assistant", "content": assistant_message}
    )

    return assistant_message

Test the agent with a single message to verify tracing works:

reply = chat(
    "Hi, I'm Sarah. Where is my order?",
    session_id="test-session",
)
print(reply)
# Check the MLflow UI at http://127.0.0.1:5000 --
# you should see a trace with the OpenAI call.

Define realistic customer support conversations. Each conversation is a sequence of messages sharing the same session_id. The conversations cover a range of outcomes: a smooth resolution, a frustrated customer, and a case where the agent fails to address all questions.

conversations = [
    {
        "session_id": "session-order-status",
        "turns": [
            "Hi, my name is Sarah. Can you check on my order?",
            "It's order number 98765.",
            "Great, thanks! Can I also change the shipping "
            "address?",
            "The new address is 123 Oak St, Portland, OR.",
            "That's all, thank you!",
        ],
    },
    {
        "session_id": "session-frustrated-customer",
        "turns": [
            "I've been waiting 3 weeks for my refund! "
            "Order 44312.",
            "You said that last time. When exactly will I "
            "get my money back?",
            "This is ridiculous. I want to speak to a "
            "manager right now.",
            "I'm not going to wait any longer. Fix this "
            "or I'm filing a chargeback.",
        ],
    },
    {
        "session_id": "session-incomplete",
        "turns": [
            "Hey, I need help with two things. First, "
            "where's order 77210?",
            "Ok. Second, I want to return the shoes from "
            "order 77210. They don't fit.",
            "What's the return window? And do I get free "
            "return shipping?",
        ],
    },
]

Each call to chat() produces a traced turn. Because mlflow.openai.autolog() is enabled, every OpenAI call is automatically captured. The session_id metadata groups turns into sessions.

# Clear any prior test history
conversation_histories.clear()

for convo in conversations:
    session_id = convo["session_id"]
    for turn in convo["turns"]:
        reply = chat(turn, session_id=session_id)
        print(f"[{session_id}] User: {turn[:50]}...")
        print(f"[{session_id}] Agent: {reply[:80]}...")
        print()

Retrieve the traces and run three session-level scorers:

ConversationCompleteness -- did the agent address all user requests by the end? Returns "yes" or "no".
ConversationalGuidelines -- did the agent follow the support rules across the full conversation? Returns "yes" or "no".
UserFrustration -- did the user show frustration, and was it resolved? Returns "none", "resolved", or "unresolved".

from mlflow.genai.scorers import (
    ConversationCompleteness,
    ConversationalGuidelines,
    UserFrustration,
)

traces = mlflow.search_traces(
    experiment_ids=[
        mlflow.get_experiment_by_name(
            "multi-turn-agent-eval"
        ).experiment_id
    ],
    return_type="list",
)

results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        ConversationCompleteness(),
        ConversationalGuidelines(
            guidelines=[
                "Always greet the customer by name if "
                "provided",
                "Ask for the order number if not provided",
                "Never promise specific refund timelines",
                "End each response by asking if there's "
                "anything else to help with",
            ],
        ),
        UserFrustration(),
    ],
)

print(results.metrics)
# Example output:
# {
#   'conversation_completeness/mean': 0.67,
#   'conversational_guidelines/mean': 0.67,
#   'user_frustration/mean': ...,
# }

The per-session breakdown shows which conversations had problems:

df = results.result_df

score_cols = [c for c in df.columns if c.endswith("/value")]
rationale_cols = [
    c for c in df.columns if c.endswith("/rationale")
]

print(df[score_cols])
# conversation_completeness/value: "yes", "yes", "no"
# conversational_guidelines/value: "yes", "no", "yes"
# user_frustration/value: "none", "unresolved", "none"

Inspect rationales for failed scores to understand what went wrong in specific sessions:

for _, row in df.iterrows():
    for col in score_cols:
        val = row[col]
        scorer_name = col.replace("/value", "")
        rationale_col = f"{scorer_name}/rationale"

        is_failure = (
            val == "no"
            or val == "unresolved"
            or val is False
        )
        if is_failure and rationale_col in df.columns:
            print(f"Scorer: {scorer_name}")
            print(f"  Value: {val}")
            print(f"  Rationale: {row[rationale_col]}")
            print()
# Example output:
# Scorer: conversation_completeness
#   Value: no
#   Rationale: The user asked about the return window
#     and free return shipping, but the agent did not
#     fully address both questions.
#
# Scorer: user_frustration
#   Value: unresolved
#   Rationale: The user expressed escalating frustration
#     about a delayed refund and the agent was unable
#     to resolve the situation.

Open the MLflow UI at http://127.0.0.1:5000 and navigate to the multi-turn-agent-eval experiment. The evaluation run shows per-session scores with linked traces. Click any session to walk through the full conversation and see where the agent fell short.

Results Interpretation

Session	Completeness	Guidelines	Frustration
order-status	yes	yes	none
frustrated-customer	yes	no	unresolved
incomplete	no	yes	none

order-status: Clean conversation. All questions answered, guidelines followed, no frustration.
frustrated-customer: The agent likely promised a refund timeline (violating guideline 3) or failed to escalate to a human (violating guideline 4). The user's frustration was never resolved.
incomplete: The agent didn't fully address the return window and shipping cost questions. No frustration detected because the user didn't express any, but the conversation ended with open questions.

These results point to specific improvements: better escalation handling for angry customers and more thorough follow-through on multi-part questions.

Next Steps

Evaluate Conversations -- Full reference for multi-turn evaluation with session tracing and conversation simulation
Built-in Scorers Reference -- All available conversational scorers including KnowledgeRetention, ConversationalSafety, and ConversationalRoleAdherence
Custom LLM Judges -- Build domain-specific judges for your use case

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Cookbook

Ambassador Program

Evaluating a Multi-Turn Conversational Agent

Results Interpretation

Next Steps

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Cookbook

Ambassador Program

Results Interpretation​

Next Steps​

Results Interpretation

Next Steps