Skip to main content

LLM-based Scorers (LLM-as-a-Judge)

LLM-as-a-Judge is an evaluation approach that uses Large Language Models to assess the quality of AI-generated responses. LLM judges can evaluate subjective qualities like helpfulness and safety, which are hard to measure with heuristic metrics. On the other hand, LLM-as-a-Judge scorers are more scalable and cost-effective than human evaluation.

Try the Judge Builder UI

The fastest way to create LLM judges is through the Judge Builder UI - no code required. Navigate to your experiment's Judges tab to create and test judges visually. See the Template-based Scorers page for details. Requires MLflow >= 3.9.0.

Approaches for Creating LLM Scorers

MLflow offers different approaches to use LLM-as-a-Judge, with different levels of simplicity and control. Click on the card below to see the detailed guide for each approach.

Selecting Judge Models

By default, MLflow will use OpenAI's GPT-4o-mini model as the judge model. You can change the judge model by passing an override to the model argument within the scorer definition. The model must be specified in the format of <provider>:/<model-name>.

python
from mlflow.genai.scorers import Correctness

Correctness(model="openai:/gpt-4o-mini")
Correctness(model="anthropic:/claude-4-opus")
Correctness(model="google:/gemini-2.0-flash")

Supported Models

AI Gateway Endpoints

AI Gateway endpoints are the recommended way to configure judge models, especially when creating judges from the UI. Benefits include:

  • Run judges directly from the UI - Test and execute judges without leaving the browser
  • Centralized API key management - No need to configure API keys locally
  • Traffic routing and fallbacks - Configure load balancing and provider fallbacks

To use AI Gateway endpoints, select the endpoint from the UI dropdown or specify the endpoint name from the SDK with the gateway:/ prefix, e.g., gateway:/my-chat-endpoint.

Direct Model Providers

MLflow also supports calling model providers directly:

  • OpenAI / Azure OpenAI
  • Anthropic
  • Amazon Bedrock
  • Cohere
  • Together AI
  • Any other providers supported by LiteLLM, such as Google Gemini, xAI, Mistral, and more.
warning

Judges configured with direct model providers require API keys to be set locally (e.g., OPENAI_API_KEY) and cannot be run from the UI. Use AI Gateway endpoints if you want to run the judges from the UI.

To use LiteLLM integrated models, install LiteLLM by running pip install litellm and specify the provider and model name in the same format as natively supported providers, e.g., gemini:/gemini-2.0-flash.

info

In Databricks, the default model is set to Databricks's research-backed LLM judges.

Choosing the Right LLM for Your Judge

The choice of LLM model significantly impacts judge performance and cost. Here's guidance based on your development stage and use case:

Early Development Stage (Inner Loop)

  • Recommended: Start with powerful models like GPT-4o or Claude Opus
  • Why: When you're beginning your agent development journey, you typically lack:
    • Use-case-specific grading criteria
    • Labeled data for optimization
  • Benefits: More intelligent models can deeply explore traces, identify patterns, and help you understand common issues in your system
  • Trade-off: Higher cost, but lower evaluation volume during development makes this acceptable

Production & Scaling Stage

  • Recommended: Transition to smaller models (GPT-4o-mini, Claude Haiku) with smarter optimizers
  • Why: As you move toward production:
    • You've collected labeled data and established grading criteria
    • Cost becomes a critical factor at scale
    • You can align smaller judges using more powerful optimizers
  • Approach: Use a smaller judge model paired with a powerful optimizer model (e.g., GPT-4o-mini judge aligned using Claude Opus optimizer)