Top 5 Agent Evaluation Tools in 2026

Shipping an AI agent without evaluation is like deploying code without tests. As agents grow more complex, with multi-step reasoning, tool use, and autonomous decision-making, you need evaluation frameworks that can score not just the final output but every step along the way. In this guide, we compare the top five agent evaluation frameworks and help you choose the right one for your team.

TL;DR

MLflow , the most widely adopted open source AI engineering platform with 30M+ monthly downloads, is our top pick. It has the broadest metric coverage, supports both rule-based and LLM judge custom metrics, and lets human reviewers label results so that automated judges improve automatically from that feedback.

Alternatives: DeepEval for pytest-style CI/CD testing, Ragas for RAG-focused metrics, Arize Phoenix for teams extending their existing ML observability to LLM evaluation. All three also integrate natively with MLflow so their metrics can be used as plugins.

What to Look For in an Agent Evaluation Tool

Every framework on this list can score LLM outputs. The real question is how deeply they can evaluate agent behavior, and how well they integrate with the rest of your development workflow. Before comparing tools, here are the four capabilities that separate production-grade evaluation from basic output checking.

1. Comprehensive, high-quality built-in metrics

Built-in metrics are a good start, but almost every real project requires some sort of custom evaluation criteria. The framework must make it easy to define both rule-based checks (format validation, tool call verification) and LLM judges (for nuanced dimensions like helpfulness and safety). Lacking this capability significantly limits the value of evaluation.

2. Seamless path from manual feedback to automated evaluation

Human feedback is the ground truth for agent quality. But manually reviewing every response does not scale. Look for frameworks that let you collect human labels on a sample of outputs and then automatically improve your automated judges to match those human assessments. This lets you start with human review and gradually shift to automated evaluation without sacrificing accuracy.

3. Multi-turn and conversation-level evaluation

Real agents are conversational. A single-turn eval setup that tests one request and one response misses most of how agents actually behave in production: clarifying questions, follow-ups, context carried across turns, and recovering from earlier mistakes. The framework should be able to evaluate a full conversation as a unit, score quality across turns, and ideally simulate synthetic conversations for test coverage beyond your labeled dataset.

4. A feedback loop between production and development

Test datasets are necessary but not sufficient. Production traffic behaves differently from your test set, models drift, and user behavior shifts over time. Look for frameworks that close the gap: production traces should be convertible into evaluation datasets, quality regressions should surface as actionable signals, and the cycle from "found a problem in production" to "verified the fix in staging" should be short.

5. Intuitive visualization

Numbers alone are not enough. When scores drop or an edge case surfaces, you need to quickly understand why. Look for frameworks that display trace timelines, per-turn scores, and metric breakdowns in a clear UI rather than requiring you to dig through JSON logs or build your own dashboards.

Quick Comparison

Capability	MLflow	DeepEval	Ragas	Arize Phoenix	LangSmith
Open Source	✔️	✔️	✔️	Partial (ELv2)	No
PyPI Downloads	30M+/mo	1.9M+/mo	1M+/mo	1M+/mo	65M+/mo ¹
Dataset Management	✔️	SDK-only	No	✔️	✔️
Multi-Turn Evaluation	✔️	✔️	✔️	Limited	✔️
Conversation Simulation	✔️	✔️	No	No	No
Human Feedback Collection	✔️	SDK-only	No	✔️	✔️
LLM Judge Alignment	✔️ (automated tuning)	No	No	No	Manual tuning with UI
Online Monitoring	✔️	No	No	✔️	✔️
Visualization	✔️	Requires Confident AI (paid)	No	✔️	✔️

¹ LangSmith's PyPI count is inflated because it is an automatic dependency of the langchain package.

1. MLflow - The Complete Evaluation Platform

MLflow is the most widely deployed open source AI engineering platform, and its evaluation system is designed specifically for the agent development loop. Unlike standalone evaluation libraries that score outputs in isolation, MLflow's scorer (LLM judge) framework evaluates full execution traces, including tool calls, reasoning chains, and planning decisions. Combined with tracing, prompt optimization, and governance, MLflow provides the complete evaluation-to-improvement pipeline in one tool.

Trace-Aware Scorers That Evaluate the Full Agent Reasoning Loop

MLflow's evaluation API, mlflow.genai.evaluate(), is designed to evaluate agents as they actually run. Scorers receive the complete execution trace, not just the final output, so they can assess tool selection, plan quality, logical consistency, execution efficiency, and plan adherence across the entire reasoning loop. MLflow includes built-in Agent GPA (Goal-Plan-Action) scorers for common agent evaluation patterns, and you can write custom scorers in Python for any domain-specific criteria. The evaluation harness runs your agent and scorers in parallel, recording all results as traces and feedback in the MLflow tracking server.

LLM Judge Alignment with Research-Backed Algorithms

Most evaluation frameworks let you define LLM judges but give you no way to verify they are actually calibrated to human judgment. MLflow's judge alignment is built on research-backed algorithms including GEPA and MemAlign, which optimize judge prompts against your human labels so that automated scores track what reviewers actually care about. The result is automated evaluation you can trust, not just run.

Widest Metric Coverage with Native Library Integrations

MLflow natively integrates with the leading evaluation libraries as pluggable scorers: Ragas, DeepEval, Arize Phoenix, TruLens, and Guardrails AI. This means you can use the best metrics from each library within mlflow.genai.evaluate() without building custom glue code. Combined with MLflow's own built-in scorers and the ability to write custom scorers with the @scorer decorator, MLflow provides the widest metric coverage of any evaluation framework on this list. All scorer results are tracked, versioned, and visualized in the MLflow UI alongside traces and experiments. MLflow also includes automatic issue detection that scans thousands of traces using LLMs and clustering algorithms to surface actionable quality issues, ranked by severity, without requiring manual trace review. Combined with a built-in AI Assistant, this makes issue identification streamlined even for less technical team members.

Pros	Cons
Fully open source (Apache 2.0) with Linux Foundation governance. No feature gating.	Broader feature set than single-purpose eval libraries, which may not be needed for simple use cases
Widest metric coverage: native integration with Ragas, DeepEval, Phoenix, TruLens, and Guardrails as pluggable scorers
Built-in human feedback collection and automated LLM judge alignment

Best for: Teams that need the widest metric coverage for agent evaluation, with native integration for leading eval libraries, a unified interface for custom and LLM judge metrics, and built-in human feedback alignment. The only fully open source platform that connects evaluation to prompt optimization and governance.

How does MLflow evaluate agent traces, not just outputs?

MLflow's scorer framework receives the complete execution trace from MLflow Tracing, including every tool call, LLM invocation, and planning step. Built-in Agent GPA scorers evaluate tool selection, plan quality, logical consistency, and execution efficiency across the full agent trajectory. Custom scorers can access any part of the trace to implement domain-specific evaluation logic.

Can I use MLflow with other eval libraries?

Yes. MLflow integrates with Ragas, DeepEval, Arize Phoenix, TruLens, and Guardrails AI as evaluation libraries. You can use their metrics as scorers within mlflow.genai.evaluate(), combining the best metrics from multiple libraries while keeping results centralized in MLflow.

Does MLflow evaluation work in CI/CD?

Yes. mlflow.genai.evaluate() can be called from any Python script, including CI/CD pipelines. Evaluation results are logged to the MLflow tracking server, and you can set pass/fail thresholds programmatically to gate deployments based on evaluation scores.

What is prompt optimization and how does it connect to evaluation?

MLflow's prompt optimization uses algorithms like GEPA and MIPRO to automatically find better prompts based on your evaluation criteria. After running evaluation and identifying quality issues, you feed the results into the optimizer, which explores prompt variations and selects the ones that score highest on your metrics.

2. DeepEval - Pytest-Native Evaluation with 50+ Metrics

DeepEval is an open source LLM evaluation framework built by Confident AI that brings a pytest-native testing experience to agent evaluation. With 50+ research-backed metrics and a familiar testing interface, DeepEval makes it easy to add LLM evaluation to existing CI/CD workflows. The framework covers agents, chatbots, RAG, single-turn, multi-turn, and safety evaluation, all from a single library.

DeepEval documentation showing evaluation framework features and getting started guide

Testing LLM Agents Like You Test Regular Code

DeepEval's defining feature is its pytest-native interface. You write evaluation tests using the same patterns and tooling that Python developers already know: assert_test() calls, test discovery, fixtures, and familiar CLI output. This means evaluation integrates naturally into CI/CD pipelines. Teams can run agent evaluation as part of their standard test suite, catching regressions on every pull request without introducing new tooling or workflows.

50+ Pre-built Metrics Covering Agents, RAG, and Safety

DeepEval ships with a broad library of 50+ metrics covering tool selection accuracy, planning quality, faithfulness, reasoning coherence, hallucination detection, answer relevancy, and safety. Metrics use LLM judges and NLP models that can run locally, keeping evaluation costs predictable. The framework covers single-turn, multi-turn, and agentic evaluation in one package.

Limited Platform Integration Without Confident AI

DeepEval itself is a Python library, not a platform. To visualize evaluation results, manage datasets, or collaborate across teams, you need the Confident AI platform, which starts at $19.99/user/month. DeepEval does not include tracing, prompt optimization, or governance capabilities. Teams using DeepEval for evaluation will need separate tools for observability and the rest of the agent development lifecycle.

Pros	Cons
Open source (Apache 2.0) with pytest-native testing interface	Visualization and team collaboration require Confident AI ($19.99+/user/month)
50+ pre-built metrics covering agents, RAG, chatbots, and safety	No production monitoring, tracing, prompt optimization, or governance capabilities
	Not trace-aware; evaluation runs on provided data, not production traces

Best for: Teams that want to add agent evaluation to existing pytest CI/CD workflows with a rich library of pre-built metrics. Pairs well with a tracing platform like MLflow for production observability.

Is DeepEval really free?

The DeepEval library is free and open source under Apache 2.0. You can run all 50+ metrics locally without any paid subscription. The Confident AI platform, which adds visualization, dataset management, and team collaboration, has a free tier limited to 2 seats and 5 test runs per week. Paid plans start at $19.99/user/month.

How does DeepEval handle agent-specific evaluation?

DeepEval supports span-level evaluation that scores each step of an agent independently. You can evaluate tool selection accuracy, planning quality, step-level faithfulness, and reasoning coherence. The framework also supports multi-turn evaluation for conversational agents.

Can I use DeepEval with MLflow?

Yes. MLflow integrates with DeepEval as an evaluation library. You can use DeepEval's metrics as scorers within mlflow.genai.evaluate(), combining DeepEval's rich metric library with MLflow's trace-aware evaluation and platform capabilities.

What is the difference between DeepEval and Confident AI?

DeepEval is the open source evaluation library that provides the metrics and testing logic. Confident AI is the commercial platform built by the same team that adds visualization, dataset management, production monitoring, and team collaboration on top of DeepEval.

3. Ragas - Research-Backed RAG and Agent Evaluation

Ragas is an open source evaluation framework that started as the standard for RAG evaluation and has expanded to cover agent evaluation as well. Born from an EACL 2024 research paper, Ragas provides research-validated metrics for faithfulness, answer relevancy, context precision, agent goal accuracy, and tool call accuracy. It is a lightweight library with no platform dependency, making it easy to integrate into any evaluation workflow.

Ragas documentation showing agent evaluation metrics including goal accuracy and tool call accuracy

Built for RAG Evaluation

Ragas established many of the evaluation metrics that other frameworks have since adopted. Its core RAG metrics (faithfulness, answer relevancy, context precision, context recall) are well-documented in peer-reviewed research and widely cited in the community. For teams evaluating RAG pipelines, Ragas provides a thoroughly validated set of metrics, with clear mathematical definitions and known failure modes documented in the research literature.

Agent Evaluation with Goal and Tool Accuracy

Ragas has expanded beyond RAG to support agent evaluation with metrics like AgentGoalAccuracy (did the agent achieve the user's goal?), ToolCallAccuracy (did the agent call the right tools with the right parameters?), and TopicAdherence (did the agent stay on topic?). Multi-turn evaluation is supported through the MultiTurnSample class, which represents conversational interactions between humans, AI, and tools. The AspectCritic metric provides flexible evaluation of multi-turn conversations against custom criteria.

A Library, Not a Platform

Ragas is a Python library with no UI, no tracing backend, and no platform layer. This is both a strength and a limitation. It is easy to integrate into any workflow and has no vendor lock-in, but teams will need to build their own infrastructure for visualizing results, managing datasets, and tracking evaluation results over time. Ragas metrics use LLM calls for scoring, which adds cost and latency to each evaluation run. The framework also does not provide trace-aware evaluation; you need to extract and format data from your tracing system before passing it to Ragas.

Pros	Cons
Open source (Apache 2.0) with research-validated metrics from peer-reviewed papers	No UI, tracing, or platform layer; teams must build their own infrastructure
Strong RAG evaluation metrics (faithfulness, relevancy, context precision)	Not trace-aware; requires manual data extraction from tracing systems
Lightweight library with no platform dependency or vendor lock-in	LLM-based scoring adds cost and latency; no local model option

Best for: Teams that need academically validated RAG evaluation metrics and want a lightweight library they can integrate into existing pipelines. Pairs well with MLflow for tracing and platform capabilities.

Is Ragas only for RAG evaluation?

No. While Ragas started as a RAG evaluation framework, it now includes agent-specific metrics like AgentGoalAccuracy, ToolCallAccuracy, and TopicAdherence. It also supports multi-turn conversation evaluation. However, its RAG metrics remain the most mature and well-validated part of the library.

Can I use Ragas with MLflow?

Yes. MLflow integrates with Ragas as an evaluation library. You can use Ragas metrics as scorers within mlflow.genai.evaluate(), combining Ragas's research-backed metrics with MLflow's trace-aware evaluation and platform capabilities.

How much does Ragas cost to run?

The Ragas library itself is free under Apache 2.0. However, most metrics use LLM calls for scoring, so each evaluation run incurs API costs from your LLM provider. The cost depends on the number of test cases, the metrics you use, and the LLM provider pricing.

Does Ragas support custom metrics?

Yes. Ragas supports custom metrics through LLM-based judges. You can define custom evaluation criteria using natural language and use the AspectCritic metric for flexible, criteria-based evaluation of any aspect of agent behavior.

4. Arize Phoenix - Observability-First Evaluation

Arize Phoenix is an observability tool built by Arize AI that includes a strong evaluation layer. Phoenix combines distributed tracing with 50+ built-in evaluation metrics, so you can score traces directly within the same platform you use for debugging. Its evaluation system supports LLM-based evaluators, code-based checks, and human labels, with integration support for third-party libraries like Ragas and DeepEval.

Arize Phoenix UI showing traces, evaluation metrics, and agent analysis

Evaluate Traces Directly Within Your Observability Tool

Phoenix's key advantage for evaluation is that it can score traces and spans directly within the observability UI. You do not need to export data to a separate evaluation tool. Built-in evaluators cover hallucination detection, faithfulness, relevance, safety, and toxicity, and you can attach evaluation scores to individual spans for fine-grained analysis. The evaluation system supports both pre-built and custom evaluators, and integrates with third-party libraries like Ragas, DeepEval, and Cleanlab.

Familiar Ground for Teams Coming from Traditional ML Monitoring

Arize is a long-standing ML observability platform, and Phoenix carries that lineage. Teams that already use Arize for monitoring classical ML models will find the evaluation workflow familiar: attach evaluators to traces, score data quality dimensions, and track drift over time. The transition from traditional ML monitoring to LLM evaluation requires less ramp-up than switching to a purpose-built LLM evaluation tool from scratch.

Source-Available License and Platform Gaps

Phoenix uses the Elastic License 2.0 (ELv2), which restricts offering the software as a managed service. High-value features like the Alyx Copilot, online evaluations, and monitoring are gated behind paid plans on the commercial Arize AX platform. Phoenix does not offer prompt optimization or governance capabilities, and multi-turn evaluation support is more limited than dedicated evaluation libraries. The free tier retains data for only 7 days, and scaling beyond single-node deployments requires the commercial platform.

Pros	Cons
Evaluate traces directly within the observability UI, no data export needed	Elastic License 2.0 restricts use as a managed service
50+ built-in evaluation metrics with support for Ragas, DeepEval, and Cleanlab	High-value features (Alyx Copilot, online eval, monitoring) gated behind paid plans
Natural fit for teams already using Arize for traditional ML monitoring	No multi-turn or conversation-level evaluation support

Best for: Teams already using Arize for traditional ML monitoring who want to extend their existing observability workflow to cover LLM and agent evaluation. Phoenix metrics are also available as pluggable scorers in MLflow.

Is Arize Phoenix open source?

Phoenix is source-available under the Elastic License 2.0 (ELv2), which allows free use but restricts offering the software as a managed service. This is a more restrictive license than Apache 2.0 or MIT. The commercial Arize AX platform adds features not available in the open source version.

Can Phoenix evaluate agent traces, not just outputs?

Yes. Phoenix can attach evaluation scores to individual spans within a trace, so you can evaluate tool calls, retrieval steps, and reasoning independently. This provides more granular evaluation than output-only scoring.

How does Phoenix compare to MLflow for evaluation?

Both Phoenix and MLflow support trace-aware evaluation. Phoenix focuses on built-in metrics within the observability UI, while MLflow provides a broader evaluation platform with scorer framework, multi-turn evaluation, LLM judge alignment with human feedback, and automated prompt optimization based on evaluation results. MLflow also integrates with Phoenix as an evaluation library.

What is the free tier data retention?

The free self-hosted Phoenix retains data for 7 days. The commercial Arize AX platform offers 15 days on Pro and longer retention on Enterprise plans.

5. LangSmith - LangChain's Paid Evaluation Platform

LangSmith is the commercial evaluation platform built by LangChain. It provides multiple evaluator types (human, heuristic, LLM-as-judge, pairwise comparison), agent trajectory evaluation, and both offline and online evaluation modes. LangSmith is most polished for teams building on LangChain and LangGraph, though it supports other frameworks through OpenTelemetry ingestion.

LangSmith evaluation UI showing evaluation runs, scores, and dataset management

Agent Trajectory Evaluation and Annotation Queues

LangSmith can capture the full trajectory of an agent's steps, tool calls, and reasoning, and define evaluators that score intermediate decisions. This helps teams debug complex agent workflows and pinpoint where things went wrong. Annotation queues provide structured human review, where domain experts can label traces and build evaluation datasets from real production data.

Polly AI Assistant for Automated Trace Analysis

LangSmith includes Polly, an AI assistant that analyzes trace data in natural language. You can ask Polly to summarize failure patterns, identify common error types, and prioritize improvements by frequency and impact, without writing any evaluation code. Complementing this are topic clustering and an insights agent that automatically categorize agent behavior across production runs. These features are available on Plus and Enterprise plans.

Proprietary Platform with Per-Seat Pricing

LangSmith is a proprietary SaaS platform with no self-hosted option outside enterprise contracts. The free tier is limited to 5,000 traces per month with 14-day retention. The Plus plan costs $39/seat/month, and extended retention (400 days) is a paid add-on at $5/1k traces. Per-seat pricing can limit access for PMs, QA engineers, and other stakeholders who need to review evaluation results but may not justify a full seat cost.

Pros	Cons
Agent trajectory evaluation with structured human annotation queues	Proprietary SaaS with no self-hosted option outside enterprise contracts
Polly AI assistant for natural-language trace analysis without writing eval code	Per-seat pricing ($39/seat/month) limits collaboration across teams
Deep integration with LangChain and LangGraph ecosystem	Feature parity lags for integrations outside the LangChain ecosystem

Best for: Teams 100% committed to the LangChain/LangGraph ecosystem who are comfortable with proprietary SaaS pricing and do not require self-hosting.

Does LangSmith evaluation work outside LangChain?

LangSmith supports other frameworks via OpenTelemetry ingestion and a traceable wrapper. However, the evaluation experience is most polished for LangChain and LangGraph applications. Teams using other frameworks may find some features less integrated.

Can I self-host LangSmith?

Self-hosting is only available on the Enterprise tier. The free, Plus, and Team plans are SaaS-only with data stored on LangChain's infrastructure.

How does LangSmith pricing work for evaluation?

The free tier includes 5,000 traces/month with 14-day retention. Plus is $39/seat/month with higher trace limits. Extended retention (400 days) costs $5/1k traces as a paid add-on. Enterprise pricing is custom. Evaluation runs count toward your trace quota.

What types of evaluators does LangSmith support?

LangSmith supports human evaluation through annotation queues, heuristic checks (output validation, code compilation), LLM-as-judge evaluators that score against custom criteria, and pairwise comparison evaluators for A/B testing agent versions. Custom evaluators can be written in Python or TypeScript.

How to Choose the Right Framework

All five frameworks on this list can evaluate agent outputs. The difference lies in how flexible they are for your specific needs, and how well they support the full evaluation lifecycle from custom metrics to human feedback alignment.

How Much Metric Coverage Do You Actually Need?

Libraries like DeepEval and Ragas ship with broad pre-built metric sets, but pre-built metrics rarely cover everything. Evaluate whether the framework makes it easy to add custom rule-based checks and LLM judges, and whether those custom metrics are treated as first-class objects with version management and reuse. The depth of custom metric support matters more than the number of pre-built ones as your evaluation suite matures.

Do You Need a Library or a Platform?

If you just need metrics you can run in a script or CI pipeline, DeepEval and Ragas are excellent standalone libraries. If you need evaluation connected to tracing, visualization, online monitoring, and human feedback, you need a platform. Consider whether you want to assemble those capabilities yourself or use something that provides them out of the box.

How Will You Calibrate Your Automated Judges?

LLM judges are only as reliable as their alignment with human expectations. Without a way to collect human labels on a sample of outputs and adjust your judges accordingly, you risk optimizing for metrics that do not reflect real quality. Check whether the framework provides built-in human feedback collection and a mechanism to improve judges automatically from those labels.

Our Recommendation

For teams building production agents, MLflow is our top recommendation. It provides the widest metric coverage through native integration with DeepEval, Ragas, Arize Phoenix, TruLens, and Guardrails AI as pluggable scorers. Its unified @scorer decorator and make_judge() API make it easy to define, version, and manage custom metrics. And its built-in human feedback collection with the align() API for automatic LLM judge alignment is a capability you will not find in any other framework on this list. All of this is fully open source under the Apache 2.0 license, backed by the Linux Foundation. Get started with the Evaluation Quickstart.

Alternatives Worth Considering

DeepEval is a good choice for teams that want pytest integration with a large pre-built metric library. Ragas is the standard for academically-validated RAG metrics. Arize Phoenix suits teams already running Arize for traditional ML who want to extend the same workflow to LLM evaluation. All three integrate natively with MLflow as pluggable metric libraries.

Frequently Asked Questions

What is agent evaluation?

Agent evaluation is the process of systematically scoring how well an AI agent performs its tasks. Unlike traditional LLM evaluation, which focuses on output quality, agent evaluation also assesses intermediate steps: tool selection, reasoning chains, planning quality, and the overall trajectory from goal to completion. Good agent evaluation catches problems that output-only scoring misses, like an agent that arrives at the right answer through incorrect reasoning. Learn more on our LLM evaluation page.

What is the difference between output evaluation and trace-aware evaluation?

Output evaluation scores the final result of an agent against a reference or quality criteria. Trace-aware evaluation goes deeper, scoring every step the agent took, including tool calls, LLM invocations, and planning decisions. Trace-aware evaluation can identify the specific step where an agent went wrong, while output evaluation can only tell you that the final result was incorrect. See how MLflow implements trace-aware evaluation in the Evaluation and Monitoring docs.

Can I use multiple evaluation frameworks together?

Yes, but managing each library's own interface, data format, and result storage quickly becomes a mess. The practical approach is to use a platform like MLflow that integrates with multiple metric libraries through a single unified interface, so you get academically-validated RAG metrics from Ragas and pytest-native agent metrics from DeepEval without juggling separate workflows.

How do I align LLM judges with human feedback?

LLM judge alignment is the process of calibrating automated judges so their scores match what human reviewers actually care about. You collect a sample of human labels, then use an optimization algorithm to adjust the judge prompt until its outputs correlate with those human assessments. Research algorithms like MemAlign and GEPA formalize this process, making judge calibration reproducible and measurable rather than ad hoc. See the MLflow judge alignment guide for a walkthrough.

How do I evaluate multi-turn agent conversations?

Multi-turn evaluation scores agent behavior across a full conversation, not just a single request-response pair. This means tracking whether the agent carries context correctly across turns, asks good clarifying questions, recovers from earlier mistakes, and reaches the user's goal by the end of the session. Among the frameworks on this list, MLflow, DeepEval, and Ragas all support multi-turn evaluation; Arize Phoenix does not. See the multi-turn evaluation guide for practical examples.

Should I use LLM judges or deterministic metrics for agent evaluation?

Both have their place. Deterministic metrics (exact match, regex, code compilation) are fast, cheap, and reproducible, making them ideal for CI/CD gates. LLM judges handle nuanced quality dimensions like helpfulness, safety, and faithfulness that are hard to capture with rules. Most teams use deterministic metrics for basic correctness checks and LLM judges for higher-level quality assessment.

How often should I run agent evaluations?

At minimum, run offline evaluation on every code change that touches agent logic: prompt changes, tool additions, and model swaps. For production agents, online evaluation on a sample of live traffic catches quality drift that offline tests miss, because production queries behave differently from curated test sets.

LLMs & Agents

Model Training