LLM-as-a-Judge for LLM and Agent Evaluation

LLM-as-a-judge evaluates the outputs of LLM applications and agents across quality dimensions like correctness, relevance, groundedness, safety, and helpfulness. Any model can act as a judge (OpenAI, Claude, Gemini, open source models, and beyond). When a judge analyzes agent outputs and behavior, it produces a score and a written justification.

LLM-as-a-judge gives engineering teams automated quality assessment at production scale. Traditional metrics like BLEU and ROUGE measure token overlap but miss whether a response hallucinated or violated tone guidelines. Human reviewers catch these issues but can only evaluate a limited number of outputs per day. As LLM applications move from prototypes to production, judge-based evaluation becomes essential for maintaining quality and catching regressions.

MLflow's evaluation framework supports LLM-as-a-judge with built-in judges, custom judge creation, and automatic tracking of every evaluation run across model versions, prompt variants, and system changes. MLflow also supports aligning judges with human feedback, so that automated scores stay calibrated to your team's quality standards.

Quick Navigation:

Why LLM-as-a-Judge Matters

Agents, LLM applications, and RAG systems introduce evaluation challenges that traditional testing can't address:

Human Evaluation Doesn't Scale

Problem: Manual review creates bottlenecks and inconsistency. Different reviewers score the same output differently, and coverage is always incomplete.

Solution: LLM judges apply the same criteria to every output, achieving over 80% agreement with human evaluators and eliminating reviewer inconsistency.

Traditional Metrics Miss Nuance

Problem: BLEU and ROUGE measure token overlap but can't assess whether a response is actually helpful, appropriate, or safe for the user.

Solution: LLM judges understand context and intent. They evaluate the qualities users actually care about: accuracy, helpfulness, tone, and policy compliance.

Production Monitoring is Critical

Problem: Quality regressions in production go undetected until users report them. Manual spot-checking can't keep up.

Solution: Run judges continuously against production traces, catching degradation within seconds. Set thresholds and alert when scores drop.

Iteration Requires Measurement

Problem: Without quantitative metrics, teams can't measure whether prompt changes or model upgrades improve quality. Progress is guesswork.

Solution: LLM judges provide quantitative scores. (For example: 'Baseline: 3.2 correctness. New prompt: 3.8'). MLflow tracks all runs for easy comparison.

Common Use Cases for LLM-as-a-Judge

Every production LLM system needs evaluation. If the check is deterministic (exact match, regex, JSON schema validation), a code-based scorer is the right tool. But most of the things that matter in production (did the response hallucinate? did it follow your brand guidelines? was it actually helpful?) can only be assessed by something that understands language.

  • Your customer support bot keeps making up return policies. Run RetrievalGroundedness on every response to catch when the model invents facts that aren't in your knowledge base. Identify whether the problem is bad retrieval or bad generation.
  • You rewrote your prompt and need to know if it's actually better. Evaluate both versions across 500 test inputs with Correctness and RelevanceToQuery judges. MLflow tracks both runs side by side so you can compare scores and deploy with confidence.
  • Your agent is burning through API credits. ToolCallEfficiency identifies redundant tool calls and unnecessary reasoning loops. Teams commonly find agents making 3-4x more LLM calls than needed before optimizing.
  • Legal needs to sign off before you ship to production. Define compliance rules with Guidelines judges: "never provide specific medical dosage recommendations" or "always include a disclaimer for financial advice." Run these on every output automatically.
  • Users are abandoning your chatbot mid-conversation. Multi-turn judges like UserFrustration and KnowledgeRetention reveal whether the agent is losing context, going in circles, or failing to resolve issues across conversation turns.

How to Implement LLM-as-a-Judge

MLflow's evaluation framework ships with built-in judges for grounding, correctness, safety, relevance, and custom guidelines. For conversational applications, multi-turn judges evaluate complete sessions for context retention and user satisfaction. For agentic systems, tool call judges assess whether agents are choosing the right tools and using them efficiently. MLflow also integrates with DeepEval, RAGAS, and Arize Phoenix for 20+ additional evaluation metrics.

When built-in judges don't cover your needs, create custom judges through the Judge Builder UI or in code. If judges don't match your team's quality standards, judge optimization with MemAlign (experimental) uses human feedback to automatically refine judge instructions, improving agreement with human evaluators by 30-50%.

Check out the evaluation documentation for detailed guides and API reference.

Evaluation with Built-in and Custom Judges

python
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
from mlflow.genai.judges import make_judge
from typing import Literal
# Define a custom judge
domain_accuracy = make_judge(
name="domain_accuracy",
instructions=(
"Evaluate whether the {{ outputs }} provides"
" accurate domain-specific information for"
" the given {{ inputs }}."
),
feedback_value_type=Literal["accurate", "inaccurate"],
model="openai:/gpt-5",
)
# Run evaluation with built-in and custom judges together
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[
Correctness(),
RelevanceToQuery(),
domain_accuracy,
],
)

Judge Optimization with MemAlign

python
from mlflow.genai.judges.optimizers import MemAlignOptimizer
optimizer = MemAlignOptimizer(reflection_lm="openai:/gpt-5")
aligned_judge = my_judge.align(
traces=labeled_traces,
optimizer=optimizer,
)
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[aligned_judge],
)
Build and iterate on custom judges in the MLflow UI, no code required.

MLflow is the largest open-source AI engineering platform for agents, LLMs, and ML models, with over 30 million monthly downloads. Thousands of organizations use MLflow to debug, evaluate, monitor, and optimize production-quality AI agents and LLM applications while controlling costs and managing access to models and data. Backed by the Linux Foundation and licensed under Apache 2.0, MLflow provides built-in judges, custom judge creation, and judge optimization with no vendor lock-in. Get started →

Frequently Asked Questions

LLM-as-a-judge is an evaluation technique where one LLM evaluates the quality of another LLM's outputs. Instead of relying on human reviewers or simple metrics, you use a judge model (like GPT, Claude, or Gemini) to assess correctness, relevance, safety, and groundedness, producing both a score and a justification.

Related Resources