Question 1

What is LLM-as-a-judge?

Accepted Answer

LLM-as-a-judge is an evaluation technique where one LLM evaluates the quality of another LLM's outputs. Instead of relying on human reviewers or simple metrics, you use a judge model (like GPT, Claude, or Gemini) to assess correctness, relevance, safety, and groundedness, producing both a score and a justification.

Question 2

When should I use LLM-as-a-judge evaluation?

Accepted Answer

Use LLM-as-a-judge when you need scalable, consistent evaluation that goes beyond simple metrics. It's essential for assessing subjective qualities like helpfulness, coherence, and appropriateness that BLEU, ROUGE, and exact match can't measure. It's also the standard approach for production monitoring where manual review can't scale.

Question 3

Which models work best as judges?

Accepted Answer

The most capable models like GPT, Claude, and Gemini typically perform best as judges. They follow complex evaluation criteria, provide detailed reasoning, and catch subtle quality issues. Use smaller or faster models for cost-sensitive scenarios with simpler criteria.

Question 4

How accurate are LLM judges compared to humans?

Accepted Answer

Properly calibrated LLM judges achieve over 80% agreement with human evaluators on correctness and readability, and 95% agreement when measured within one-score distance. This matches typical human-to-human inter-rater reliability. Judge optimization with MemAlign can improve agreement by an additional 30-50%.

Question 5

What are common pitfalls with LLM-as-a-judge?

Accepted Answer

LLM judges can exhibit biases: position bias (favoring certain answer positions in comparisons), verbosity bias (preferring longer responses), and self-preference bias (favoring outputs from the same model family). Mitigate these with clear rubrics, calibration examples, and validation against human feedback. MLflow's judge optimization automates this calibration process.

Question 6

How do I define good evaluation criteria?

Accepted Answer

Good criteria are specific, measurable, and include examples. Instead of "Is this response good?", use "Does the response directly answer the user's question with accurate information supported by the retrieved documents?" Include 3-5 example judgments with scores and justifications to calibrate the judge.

Question 7

How much does LLM-as-a-judge cost?

Accepted Answer

Costs depend on your judge model and evaluation volume. Track spending using MLflow's trace cost tracking. Strategies to reduce costs: use smaller judge models for simpler criteria, evaluate a sample of production traffic rather than 100%, and cache judgments for repeated inputs.

Question 8

How do I get started with LLM-as-a-judge in MLflow?

Accepted Answer

Install MLflow, prepare an evaluation dataset with inputs and outputs, and run mlflow.genai.evaluate() with built-in scorers like Correctness and RetrievalGroundedness. MLflow provides 50+ built-in judges that work immediately, or you can create custom judges with the Judge Builder UI—no code required.

Question 9

Can I use LLM-as-a-judge for agent evaluation in MLflow?

Accepted Answer

Yes. MLflow provides specialized judges for agent evaluation: ToolCallCorrectness assesses whether agents choose appropriate tools with correct arguments, and ToolCallEfficiency identifies redundant calls and wasted token spend. For conversational agents, multi-turn judges evaluate dialogue quality across complete sessions.

Question 10

Can I create judges without writing code in MLflow?

Accepted Answer

Yes. MLflow's Judge Builder UI lets you create judges visually. Navigate to your experiment's Judges tab, define evaluation instructions in natural language, test against sample traces, and deploy. The UI is ideal for rapid prototyping and iterating on evaluation criteria.

Question 11

How do I improve judge accuracy in MLflow when scores don't match my quality standards?

Accepted Answer

Use MLflow's judge optimization with MemAlign. Provide 20-50 examples where humans have scored outputs, and MemAlign automatically refines the judge's instructions using a dual-memory system. This improves agreement with human evaluators by 30-50%, transforming generic judges into domain-specific evaluators.

Question 12

Can I use LLM-as-a-judge for production monitoring in MLflow?

Accepted Answer

Yes. MLflow's online evaluation runs judges continuously against production traces. Set quality thresholds and receive alerts when scores drop. MLflow's in-UI trace evaluation lets you run judges on any trace directly from the interface without code.

Question 13

How do I evaluate multi-turn conversations in MLflow?

Accepted Answer

MLflow supports multi-turn evaluation and simulation. Use session-level judges like ConversationCompleteness, KnowledgeRetention, and UserFrustration to evaluate complete conversation threads. Simulation mode replaces manual testing with an LLM that plays the user role, evaluating hundreds of conversations in minutes.

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Cookbook

Ambassador Program

LLM-as-a-Judge for LLM and Agent Evaluation

Why LLM-as-a-Judge Matters

Human Evaluation Doesn't Scale

Traditional Metrics Miss Nuance

Production Monitoring is Critical

Iteration Requires Measurement

Common Use Cases for LLM-as-a-Judge

How to Implement LLM-as-a-Judge

Frequently Asked Questions

Related Resources