Evaluating Agent Behavior Patterns

Agent-as-a-Judge excels at evaluating complex agent behaviors that are impossible to assess from inputs and outputs alone. They can understand reasoning chains, tool selection strategies, and decision-making processes.

Key Agent Behaviors to Evaluate

Reasoning Quality

Assess logical progression, assumption validity, and conclusion soundness in multi-step reasoning.

Tool Selection

Evaluate if agents choose appropriate tools for tasks and use them in optimal sequences.

Error Recovery

Check how agents handle failures, implement retry logic, and fall back to alternatives.

Efficiency Patterns

Identify redundant operations, unnecessary loops, and opportunities for optimization.

Common Agent Evaluation Patterns

Reasoning Chain Validation

from mlflow.genai.judges import make_judge

reasoning_judge = make_judge(
    name="reasoning_validator",
    instructions=(
        "Evaluate the reasoning chain in {{ trace }}.\n\n"
        "Analysis criteria:\n"
        "1. Logical Progression: Does each step follow logically from the previous?\n"
        "2. Assumption Validity: Are assumptions reasonable and stated?\n"
        "3. Evidence Usage: Is evidence properly cited and used?\n"
        "4. Conclusion Soundness: Does the conclusion follow from the premises?\n\n"
        "Identify specific reasoning flaws with span IDs.\n"
        "Score 1-100 for reasoning quality."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Tool Usage Optimization

tool_optimization_judge = make_judge(
    name="tool_optimizer",
    instructions=(
        "Analyze tool usage patterns in {{ trace }}.\n\n"
        "Check for:\n"
        "1. Unnecessary tool calls (could be answered without tools)\n"
        "2. Wrong tool selection (better tool available)\n"
        "3. Inefficient sequencing (could parallelize or reorder)\n"
        "4. Missing tool usage (should have used a tool)\n\n"
        "Provide specific optimization suggestions.\n"
        "Rate efficiency as: 'optimal', 'good', 'suboptimal', or 'poor'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Loop and Recursion Detection

loop_detector_judge = make_judge(
    name="loop_detector",
    instructions=(
        "Detect problematic loops in {{ trace }}.\n\n"
        "Identify:\n"
        "1. Infinite loop risks\n"
        "2. Unnecessary iterations\n"
        "3. Circular reasoning patterns\n"
        "4. Recursive calls without proper termination\n\n"
        "Report specific span patterns that indicate issues.\n"
        "Classify as: 'clean', 'warning', or 'critical'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Practical Examples

RAG Agent Evaluation

Evaluate retrieval-augmented generation agents:

rag_judge = make_judge(
    name="rag_evaluator",
    instructions=(
        "Evaluate the RAG agent's behavior in {{ trace }}.\n\n"
        "Check for:\n"
        "1. Were the right documents retrieved?\n"
        "2. Is the response grounded in the retrieved context?\n"
        "3. Are sources properly cited?\n\n"
        "Rate as: 'good', 'acceptable', or 'poor'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)


# Use with your RAG pipeline
@mlflow.trace
def rag_pipeline(query):
    docs = retrieve_documents(query)
    response = generate_with_context(query, docs)
    return response


result = rag_pipeline("What is MLflow?")
trace = mlflow.get_last_active_trace()
evaluation = rag_judge(trace=trace)

Using with mlflow.genai.evaluate()

Integrate Agent-as-a-Judge into your evaluation workflow:

import pandas as pd
import mlflow

# Simple evaluation dataset
test_data = pd.DataFrame(
    [{"question": "What is MLflow?"}, {"question": "How do I track experiments?"}]
)


# Your application function
@mlflow.trace
def my_app(question):
    # Your application logic here
    return generate_response(question)


# Run evaluation with Agent-as-a-Judge
results = mlflow.genai.evaluate(
    data=test_data,
    predict_fn=my_app,
    scorers=[reasoning_judge, tool_optimization_judge],
)

Best Practices for Agent Evaluation

Define Clear Behavioral Criteria: Specify exactly what constitutes good vs. bad agent behavior
Use Multiple Judges: Different judges can focus on different aspects (reasoning, efficiency, safety)
Request Actionable Feedback: Ask judges to provide specific improvement suggestions
Track Over Time: Monitor how agent behavior changes with updates
Combine with Field-Based Evaluation: Use both Agent-as-a-Judge and field-based judges for complete coverage

Next Steps

make_judge API

Learn the complete API for creating custom judges

View API →

Trace Analysis

Understand how judges use tools to analyze traces

Learn more →

Judge Alignment

Align judges with human feedback for better accuracy

Explore →

Key Agent Behaviors to Evaluate​

Reasoning Quality

Tool Selection

Error Recovery

Efficiency Patterns

Common Agent Evaluation Patterns​

Reasoning Chain Validation​

Tool Usage Optimization​

Loop and Recursion Detection​

Practical Examples​

RAG Agent Evaluation​

Using with mlflow.genai.evaluate()​

Best Practices for Agent Evaluation​

Next Steps​