Skip to main content

Template-based LLM Scorers

Template-based scorers let you create custom LLM judges using natural language instructions with template variables. You can create these judges using either the UI or the SDK.

Version Requirements
  • UI: The Judge Builder UI requires MLflow >= 3.9.0.
  • SDK: The make_judge API requires MLflow >= 3.4.0. For earlier versions, use the deprecated custom_prompt_judge instead.

The MLflow UI provides a visual Judge Builder that lets you create custom LLM judges without writing code.

  1. Install and start MLflow:
bash
pip install 'mlflow[genai]'
mlflow server
  1. Navigate to your experiment and select the Judges tab, then click New LLM judge
Judges Tab
  1. Select scope: Choose what you want the judge to evaluate:

    • Traces: Evaluate individual traces for quality and correctness
    • Sessions: Evaluate entire multi-turn conversations for conversation quality and outcomes
  2. Configure the judge:

    • LLM judge: Select a built-in judge or "Custom judge" to create your own. Selecting a built-in judge pre-populates the instructions, which you can then modify to customize the evaluation criteria.
    • Name: A unique identifier for your judge
    • Instructions: Define your evaluation criteria using template variables. Use the Add variable button to insert variables into your prompt.
    • Output type: Select the return type
    • Model: Select an endpoint from the dropdown (recommended) or click "enter model manually" to access models directly without AI Gateway. Endpoints can be configured using AI Gateway, which centralizes API key management. Judges using direct model access require local API keys and cannot be run directly from the UI. See Supported Models for details.
Judge Builder Dialog
  1. Test your judge (optional): Click the trace selector dropdown and choose Select traces to pick specific traces, then click Run judge to preview the evaluation result
Test Judge Output
  1. Schedule automatic evaluation (optional):

    • Automatically evaluate future traces: Enable to run this judge on new traces automatically
    • Sample rate: Percentage of traces to evaluate (0-100%)
    • Filter string: Only evaluate traces matching this filter (syntax)
  2. Click Create judge to save your new LLM judge

Template Format

Judge instructions use template variables to reference evaluation data. These variables are automatically filled with your data at runtime. Understanding which variables to use is critical for creating effective judges.

VariableDescription
inputsThe input data provided to your AI system. Contains questions, prompts, or any data your model processes.
outputsThe generated response from your AI system. The actual output that needs evaluation.
expectationsGround truth or expected outcomes. Reference answers for comparison and accuracy assessment.
conversationThe conversation history between user and assistant. Used for evaluating multi-turn conversations. Only compatible with expectations variable.
traceTrace is a special template variable which uses agent-as-a-judge. The judge has access to all parts of the trace.
Only Reserved Variables Allowed

You can only use the reserved template variables shown above (inputs, outputs, expectations, conversation, trace). Custom variables like {{ question }} will cause validation errors. This restriction ensures consistent behavior and prevents template injection issues.

Note on conversation variable: The {{ conversation }} template variable can be used with {{ expectations }}, however it cannot be combined with {{ inputs }}, {{ outputs }}, or {{ trace }} variables. This is because conversation history provides complete context, making individual turn data redundant.

Selecting Judge Models

MLflow supports two ways to configure judge models:

  • AI Gateway endpoints (Recommended for UI) - AI Gateway centralizes LLM access, which provides key benefits for evaluation workflows:

    • Run judges directly from the UI - Test and iterate on judges without configuring local API keys
    • Team collaboration - Share judges across your team without each member needing their own API keys
    • Centralized cost tracking - Monitor LLM usage across all judge executions in one place

    Select an endpoint from the UI dropdown or use the gateway:/ prefix in the SDK, e.g., gateway:/my-chat-endpoint.

  • Direct model providers - Call providers like OpenAI, Anthropic, Google, etc. directly. Requires API keys to be set locally via environment variables (e.g., OPENAI_API_KEY) and cannot be run from the UI.

See Supported Models for more details.

Specify Output Format

You can specify the type of value your judge returns. This ensures judge LLMs produce structured outputs, making results reliable and easy to use.

  • UI: Select from the Output type dropdown in the Judge Builder.
  • SDK: Use the feedback_value_type argument in make_judge. Supported types include bool, int, float, str, Literal for categorical outcomes, dict[str, <primitive>], and list[<primitive>].

Versioning Scorers

To get reliable scorers, iterative refinement is necessary. Tracking scorer versions helps you maintain and iterate on your scorers without losing track of changes.

Optimizing Instructions with Human Feedback

LLMs have biases and errors. Relying on biased evaluation will lead to incorrect decision making. Use Automatic Judge Alignment to optimize instructions using human feedback, powered by the state-of-the-art SIMBA algorithm from DSPy. For rapid iteration during development, the experimental MemAlign optimizer achieves competitive quality with up to 100× faster and 10× cheaper alignment using just a handful of examples.

Next Steps