LLM-as-a-judge evaluates the outputs of LLM applications and agents across quality dimensions like correctness, relevance, groundedness, safety, and helpfulness. Any model can act as a judge (OpenAI, Claude, Gemini, open source models, and beyond). When a judge analyzes agent outputs and behavior, it produces a score and a written justification.
LLM-as-a-judge gives engineering teams automated quality assessment at production scale. Traditional metrics like BLEU and ROUGE measure token overlap but miss whether a response hallucinated or violated tone guidelines. Human reviewers catch these issues but can only evaluate a limited number of outputs per day. As LLM applications move from prototypes to production, judge-based evaluation becomes essential for maintaining quality and catching regressions.
MLflow's evaluation framework supports LLM-as-a-judge with built-in judges, custom judge creation, and automatic tracking of every evaluation run across model versions, prompt variants, and system changes. MLflow also supports aligning judges with human feedback, so that automated scores stay calibrated to your team's quality standards.
Quick Navigation:
Agents, LLM applications, and RAG systems introduce evaluation challenges that traditional testing can't address:
Problem: Manual review creates bottlenecks and inconsistency. Different reviewers score the same output differently, and coverage is always incomplete.
Solution: LLM judges apply the same criteria to every output, achieving over 80% agreement with human evaluators and eliminating reviewer inconsistency.
Problem: BLEU and ROUGE measure token overlap but can't assess whether a response is actually helpful, appropriate, or safe for the user.
Solution: LLM judges understand context and intent. They evaluate the qualities users actually care about: accuracy, helpfulness, tone, and policy compliance.
Problem: Quality regressions in production go undetected until users report them. Manual spot-checking can't keep up.
Solution: Run judges continuously against production traces, catching degradation within seconds. Set thresholds and alert when scores drop.
Problem: Without quantitative metrics, teams can't measure whether prompt changes or model upgrades improve quality. Progress is guesswork.
Solution: LLM judges provide quantitative scores. (For example: 'Baseline: 3.2 correctness. New prompt: 3.8'). MLflow tracks all runs for easy comparison.
Every production LLM system needs evaluation. If the check is deterministic (exact match, regex, JSON schema validation), a code-based scorer is the right tool. But most of the things that matter in production (did the response hallucinate? did it follow your brand guidelines? was it actually helpful?) can only be assessed by something that understands language.
MLflow's evaluation framework ships with built-in judges for grounding, correctness, safety, relevance, and custom guidelines. For conversational applications, multi-turn judges evaluate complete sessions for context retention and user satisfaction. For agentic systems, tool call judges assess whether agents are choosing the right tools and using them efficiently. MLflow also integrates with DeepEval, RAGAS, and Arize Phoenix for 20+ additional evaluation metrics.
When built-in judges don't cover your needs, create custom judges through the Judge Builder UI or in code. If judges don't match your team's quality standards, judge optimization with MemAlign (experimental) uses human feedback to automatically refine judge instructions, improving agreement with human evaluators by 30-50%.
Check out the evaluation documentation for detailed guides and API reference.
Evaluation with Built-in and Custom Judges
import mlflowfrom mlflow.genai.scorers import Correctness, RelevanceToQueryfrom mlflow.genai.judges import make_judgefrom typing import Literal# Define a custom judgedomain_accuracy = make_judge(name="domain_accuracy",instructions=("Evaluate whether the {{ outputs }} provides"" accurate domain-specific information for"" the given {{ inputs }}."),feedback_value_type=Literal["accurate", "inaccurate"],model="openai:/gpt-5",)# Run evaluation with built-in and custom judges togetherresults = mlflow.genai.evaluate(data=eval_data,scorers=[Correctness(),RelevanceToQuery(),domain_accuracy,],)
Judge Optimization with MemAlign
from mlflow.genai.judges.optimizers import MemAlignOptimizeroptimizer = MemAlignOptimizer(reflection_lm="openai:/gpt-5")aligned_judge = my_judge.align(traces=labeled_traces,optimizer=optimizer,)results = mlflow.genai.evaluate(data=eval_data,scorers=[aligned_judge],)
MLflow is the largest open-source AI engineering platform for agents, LLMs, and ML models, with over 30 million monthly downloads. Thousands of organizations use MLflow to debug, evaluate, monitor, and optimize production-quality AI agents and LLM applications while controlling costs and managing access to models and data. Backed by the Linux Foundation and licensed under Apache 2.0, MLflow provides built-in judges, custom judge creation, and judge optimization with no vendor lock-in. Get started →