Trace Analysis with Tools

Agent-as-a-Judge uses MCP (Model Context Protocol) tools to investigate traces. These tools enable the judge to act like an experienced debugger, systematically exploring your application's execution.

Available Tools for Judges

When a judge receives a trace, it gains access to these tools:

GetTraceInfo

Retrieves high-level information about a trace including timing, status, and metadata.

ListSpans

Lists all spans in a trace with their hierarchy, timing, and basic attributes.

GetSpan

Fetches detailed information about a specific span including inputs, outputs, and custom attributes.

SearchTraceRegex

Searches for patterns across all span data using regular expressions.

Common Analysis Patterns

Performance Analysis

from mlflow.genai.judges import make_judge

latency_judge = make_judge(
    name="latency_analyzer",
    instructions=(
        "Analyze the {{ trace }} for latency issues.\n\n"
        "Use the available tools to:\n"
        "1. List all spans and their durations\n"
        "2. Identify the slowest operations\n"
        "3. Check for sequential operations that could be parallelized\n"
        "4. Look for repeated similar operations\n\n"
        "Provide specific span IDs and timings in your analysis.\n"
        "Rate as: 'fast' (<1s total), 'acceptable' (1-3s), or 'slow' (>3s)"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Tool Usage Validation

tool_usage_judge = make_judge(
    name="tool_validator",
    instructions=(
        "Examine the {{ trace }} for proper tool usage.\n\n"
        "Check:\n"
        "1. Are the right tools being selected for each task?\n"
        "2. Is the tool calling sequence logical?\n"
        "3. Are tool outputs being properly utilized?\n"
        "4. Are there unnecessary tool calls?\n\n"
        "List specific issues with span IDs.\n"
        "Rate as: 'optimal', 'suboptimal', or 'incorrect'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Error Handling Assessment

error_handling_judge = make_judge(
    name="error_handler_checker",
    instructions=(
        "Analyze error handling in the {{ trace }}.\n\n"
        "Look for:\n"
        "1. Spans with error status or exceptions\n"
        "2. Retry attempts and their patterns\n"
        "3. Fallback mechanisms\n"
        "4. Error propagation and recovery\n\n"
        "Identify specific error scenarios and how they were handled.\n"
        "Rate as: 'robust', 'adequate', or 'fragile'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Example: Complete Trace Analysis

Here's how an Agent-as-a-Judge analyzes a complex multi-step workflow:

Judge Definition
Usage
Example Output

comprehensive_judge = make_judge(
    name="comprehensive_analyzer",
    instructions=(
        "Perform a comprehensive analysis of the {{ trace }}.\n\n"
        "Investigation steps:\n"
        "1. Get trace overview with GetTraceInfo\n"
        "2. List all spans to understand the flow\n"
        "3. Identify critical path operations\n"
        "4. Check for errors or warnings\n"
        "5. Analyze data flow between components\n"
        "6. Verify business logic execution\n\n"
        "Provide:\n"
        "- Executive summary\n"
        "- Key findings with specific span references\n"
        "- Improvement recommendations\n"
        "- Overall quality rating (1-10)"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

import mlflow


# Your application with tracing
@mlflow.trace
def complex_workflow(user_query):
    with mlflow.start_span("parse_query") as parse_span:
        parsed = parse_user_input(user_query)
        parse_span.set_attributes({"query_type": parsed.type})

    with mlflow.start_span("fetch_context") as fetch_span:
        context = retrieve_relevant_context(parsed)
        fetch_span.set_attributes({"docs_retrieved": len(context)})

    with mlflow.start_span("generate_response") as gen_span:
        response = generate_with_llm(parsed, context)
        gen_span.set_attributes({"model": "gpt-4", "tokens": 500})

    return response


# Execute and evaluate
result = complex_workflow("How do I use MLflow?")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

# Judge analyzes the entire execution
evaluation = comprehensive_judge(trace=trace)
print(f"Quality Score: {evaluation.value}/10")
print(f"Analysis:\n{evaluation.rationale}")

Quality Score: 7/10

Analysis:
Executive Summary:
The workflow completed successfully in 2.3s with proper error handling. However, there are optimization opportunities in the context retrieval phase.

Key Findings:
1. Parse Query (span_id: abc123): Efficient parsing in 50ms ✓
2. Fetch Context (span_id: def456): Retrieved 15 documents in 1.8s - this is the bottleneck
   - Sequential database queries could be parallelized
   - No caching mechanism detected for repeated queries
3. Generate Response (span_id: ghi789): Clean LLM generation in 450ms ✓

Recommendations:
- Implement parallel fetching for context retrieval
- Add caching layer for frequently accessed documents
- Consider implementing streaming for faster perceived response time

Business Logic: Correctly followed the parse→retrieve→generate pattern
Error Handling: Proper try-catch blocks in all critical sections

Best Practices

Be Specific in Instructions: Tell the judge exactly what patterns to look for
Request Evidence: Ask for specific span IDs and data to support conclusions
Define Clear Criteria: Specify what constitutes "good" vs "bad" behavior
Use Structured Output: Request ratings and categorized findings for easier processing
Leverage Search: Use regex patterns to find specific issues across large traces

Advanced Techniques

Comparative Analysis

Compare multiple traces to identify regression or improvements:

comparison_judge = make_judge(
    name="trace_comparator",
    instructions=(
        "Compare the patterns in {{ trace }} against best practices.\n"
        "Identify deviations from optimal execution patterns.\n"
        "Suggest specific improvements with examples."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Security Auditing

Check for security concerns in execution patterns:

security_judge = make_judge(
    name="security_auditor",
    instructions=(
        "Audit {{ trace }} for security concerns:\n"
        "- Check for sensitive data in logs\n"
        "- Verify proper authentication flows\n"
        "- Identify potential injection points\n"
        "- Validate input sanitization"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

Next Steps

Agent Behavior Patterns

Learn to evaluate complex agent behaviors

Explore patterns →

make_judge API

Complete API reference and examples

View API →

Judge Alignment

Improve accuracy with human feedback

Learn alignment →

Available Tools for Judges​

GetTraceInfo​

ListSpans​

GetSpan​

SearchTraceRegex​

Common Analysis Patterns​

Performance Analysis​

Tool Usage Validation​

Error Handling Assessment​

Example: Complete Trace Analysis​

Best Practices​

Advanced Techniques​

Comparative Analysis​

Security Auditing​

Next Steps​