LLM-based Scorers (LLM-as-a-Judge)
LLM-as-a-Judge is an evaluation approach that uses Large Language Models to assess the quality of AI-generated responses. LLM judges can evaluate subjective qualities like helpfulness and safety, which are hard to measure with heuristic metrics. On the other hand, LLM-as-a-Judge scorers are more scalable and cost-effective than human evaluation. See the Evaluation Approach Comparison for a more detailed comparison and guides on how to use other approaches in MLflow.
Approaches for Using LLM Scorers
MLflow offers different approaches to use LLM-as-a-Judge, with different levels of simplicity and control. We recommend starting with a simpler approach and evolving to more complex ones as needed.
Predefined LLM scorers
Simplicity: ⭐⭐⭐⭐⭐ Control: ⭐
- Best for: Quickly trying MLflow's LLM evaluation capabilities with a few lines of code.
- How it works: Select from a list of built-in classes such as Correctness, RetrievalGroundedness, etc. MLflow constructs inputs for the judge using predefined prompt templates.
Get started with predefined LLM scorers »
Guidelines-based scorers
Simplicity: ⭐⭐⭐⭐ Control: ⭐⭐⭐⭐
- Best for: Evaluations based on a clear set of specific, natural language criteria, framed as pass/fail conditions. Ideal for checking compliance with rules, style guides, or information inclusion/exclusion.
- How it works: You provide a set of plain-language rules that refer to specific inputs or outputs from your app, for example "The response must be polite". An LLM then determines if the guideline passes or fails and provides a rationale.
Bring Your Own Prompt
Simplicity: ⭐⭐ Control: ⭐⭐⭐⭐⭐
- Best for: Complex, nuanced evaluations where you need full control over the scorer's prompt or need the scorer to specify multiple output values, for example, "great", "ok", "bad".
- How it works: You provide a prompt template that defines your evaluation criteria and has placeholders for specific fields in your app's trace. You define the output choices the scorer can select. An LLM then selects the appropriate output choice and provides a rationale for its selection.
Get started with bringing your own prompt »
Selecting Judge Models
By default, MLflow will use OpenAI's GPT-4o-mini model as the judge model. You can change the judge model by passing an override to the model
argument within the scorer definition. The model must be specified in the format of <provider>:/<model-name>
.
from mlflow.genai.scorers import Correctness
Correctness(model="openai:/gpt-4o-mini")
Correctness(model="anthropic:/claude-4-opus")
Correctness(model="google:/gemini-2.0-flash")
Supported Models
MLflow supports all major LLM providers:
- OpenAI / Azure OpenAI
- Anthropic
- Amazon Bedrock
- Cohere
- Together AI
- Any other providers supported by LiteLLM, such as Google Gemini, xAI, Mistral, and more.
To use LiteLLM integrated models, install LiteLLM by running pip install litellm
and specify the provider and model name in the same format as natively supported providers, e.g., gemini:/gemini-2.0-flash
.
In Databricks, the default model is set to Databricks's research-backed LLM judges.