AI agents are quickly becoming the default architecture for production LLM applications. Multi-step reasoning, tool use, planning, and autonomous decision-making introduce complexity that makes traditional logging woefully inadequate. In this guide, we compare the top five agent observability tools and help you choose the right one for your team.
MLflow, the most widely adopted open source AI engineering platform with 30M+ monthly downloads, is the top pick for teams who care about trace data ownership and want a complete platform for building production-grade agents. It covers observability, evaluation, prompt optimization, and governance in one place, with no enterprise paywalls.
Alternatives: Langfuse for ClickHouse-native self-hosting, LangSmith for teams fully committed to LangChain, Braintrust for fast prototyping with non-technical stakeholders.
Agent observability is end-to-end visibility into every step an AI agent takes in production: LLM calls, tool invocations, retrieval steps, and planning decisions. Every tool on this list can capture traces. The real question is what happens after the trace lands. Before comparing platforms, here are the three capabilities that separate production-grade observability from expensive logging.
The agent framework landscape moves fast: LangGraph, OpenAI Agents SDK, DSPy, Pydantic AI, CrewAI, and new entrants every quarter. Your observability platform should integrate with all of them through a unified API, not lock you into a single framework's ecosystem. The same goes for LLM providers, coding agents, and deployment targets. If switching frameworks means rebuilding your observability setup, the tool is a liability, not an asset.
Traces that sit in a dashboard forever do not improve your agents. A well-integrated AI platform converts your trace data into fuel for the agent improvement loop. Once traces flow into the platform, you can evaluate the agent's performance, optimize prompts, and monitor the agent's behavior in production.
Traces are among the most valuable data an AI team generates. They capture what your agents actually do in production and can contain sensitive information that must be protected. If that data is locked inside a proprietary SaaS with no export path, you are handing a strategic asset to a vendor. Look for full open source availability so you can self-host on your own infrastructure and use the database and storage systems that best fit your environment, without being locked into a single vendor's architecture.
| Capability | MLflow | Langfuse | LangSmith | Arize Phoenix | Braintrust |
|---|---|---|---|---|---|
| Open Source | ✔️ | ✔️ | No | Partial (ELv2) | No |
| License | Apache 2.0 (Linux Foundation) | MIT (ClickHouse Inc.) | Proprietary | Elastic License 2.0 (ELv2) | Proprietary |
| PyPI Downloads | 30M+/mo | 15M+/mo | 65M+/mo ¹ | 1M+/mo | 3M+/mo |
| Integration | 60+ frameworks via OpenTelemetry | 60+ frameworks via OpenTelemetry | LangChain-native + OpenTelemetry | 40+ via OpenInference + OpenTelemetry | 50+ frameworks |
| OpenTelemetry | ✔️ | Partial (ingest) | Partial (ingest) | ✔️ | Partial (ingest) |
| Governance (AI Gateway) | ✔️ | No | No | No | ✔️ |
| Self-Hosting | Simple | Complex (5+ services + ClickHouse) | Enterprise-only | Simple | Not available |
| Production Scale | ✔️ (self-hosted, scales with your infra) | ✔️ (ClickHouse-based) | ✔️ (managed SaaS) | Single-node OSS; managed SaaS for scale | ✔️ (managed SaaS) |
| Data Retention | Unlimited | 30 days (free) to 3 years (pro) | 14 days (free); 400 days (paid add-on) | 7 days (free); 15 days (pro) | 14 days (starter); 30 days (pro) |
¹ LangSmith's PyPI count is inflated because it is an automatic dependency of the langchain package.
MLflow is the most widely deployed open source AI engineering platform. Built on top of its OpenTelemetry-native observability layer, MLflow provides a complete, production-focused AI engineering platform that covers the full lifecycle from prototyping to production, including tracing, evaluation, prompt management and optimization, and governance. While other tools on this list focus on one slice of the problem, MLflow is built for teams that need to get agents into production and keep them there.

Unlike tools tied to a single vendor's commercial interests, MLflow is governed by the Linux Foundation - the trusted foundation for open source projects like Linux, Kubernetes, and PyTorch. Every feature in MLflow is available in the open source release and will remain so. There is no paywall that gates critical capabilities. The strong commitment to openness is also reflected in the technical choices MLflow makes, for example, OpenTelemetry, the vendor-neutral observability standard, is used as a foundation layer for MLflow's observability capabilities.
Most observability tools stop at tracing. MLflow goes far beyond. Its production-grade evaluation system includes built-in LLM judges, multi-turn evaluation, integration with leading eval libraries (RAGAS, DeepEval, Phoenix, TruLens, Guardrails AI), metric versioning, and the ability to align judges with human feedback.
On top of that, MLflow offers prompt optimization with state-of-the-art algorithms like GEPA and MIPRO that automatically improve prompts based on evaluation results, so you can stop tweaking prompts by hand and let the optimizer find what works.
Finally, the AI Gateway provides a centralized layer for governing LLM access across your organization, with routing, rate limiting, fallbacks, usage tracking, and credential management across providers (OpenAI, Anthropic, Bedrock, Azure, Gemini, and more). MLflow also includes a built-in AI Assistant that helps you debug traces and diagnose issues directly within the UI.
MLflow's architecture is intentionally simple: a server, a database, and object storage. You choose the database (PostgreSQL, MySQL, SQLite, AWS RDS, GCP Cloud SQL, Neon, Supabase) and the storage backend (S3, GCS, Azure Blob, HDFS, or local filesystem). Most teams deploy MLflow in minutes using infrastructure they already know. See the Tracing Quickstart to get started.
| Pros | Cons |
|---|---|
| Fully open source (Apache 2.0) with Linux Foundation governance. No feature gating or vendor lock-in. | Might not be the best fit for teams that only need quick prototyping |
| Complete platform: tracing, evaluation, prompt optimization, and governance in one tool | Broader feature set than single-purpose tracing tools, which may not be needed for simple use cases |
| Simple self-hosting with flexible backends |
MLflow provides native SDKs for both Python and TypeScript. On top of that, the tracing API is built on OpenTelemetry, so any language with an OTel SDK can export traces to MLflow's tracking server, giving you broad compatibility beyond the first-party SDKs.
MLflow's self-hosted architecture scales with your choice of database and storage backend. Teams run PostgreSQL or MySQL for metadata and S3/GCS/Azure Blob for artifacts. There are no vendor-imposed retention limits or per-trace pricing.
Every feature in MLflow is available under the Apache 2.0 license with no enterprise paywall. The project is governed by the Linux Foundation, which ensures long-term neutrality. Databricks offers a managed version for teams that prefer not to self-host, but the open source release is fully featured.
MLflow provides auto-instrumentation for 60+ frameworks via OpenTelemetry, including OpenAI Agents SDK, LangGraph, LlamaIndex, DSPy, Pydantic AI, CrewAI, Anthropic, AWS Bedrock, Google ADK, and more. See the full list in the integrations docs.
Yes. MLflow is available as a managed service on multiple cloud platforms, including Databricks, Amazon SageMaker, Azure ML, Red Hat OpenShift AI, and Nebius AI. All of these offer the same MLflow features without the overhead of self-hosting.
Langfuse is an open source observability platform focused primarily on tracing and monitoring LLM applications. Built around ClickHouse for its analytical query engine, Langfuse provides a clean UI for exploring traces, a prompt playground for manual iteration, basic LLM-as-judge scoring, and cost analytics. Teams already invested in the ClickHouse ecosystem will feel at home, though others may find the infrastructure requirements steep.

Langfuse is open source under the MIT license. Teams can self-host Langfuse on their own infrastructure, though doing so requires running 5+ services including ClickHouse, PostgreSQL, Redis, and the Langfuse application server. The managed (cloud) version handles this complexity for you, with free and paid tiers.
Langfuse offers a prompt playground that lets you iterate on prompts directly within the UI. You can compare outputs across model configurations and test prompt variations side-by-side, which is useful for manual prompt engineering workflows. Basic LLM-as-judge scoring is also available for lightweight evaluation.
Built on ClickHouse, Langfuse can handle high-throughput trace ingestion and provides fast analytical queries over large datasets. Teams already running ClickHouse infrastructure will appreciate the familiar operational model. However, this also means self-hosting locks you into ClickHouse as a dependency, and there is no option to swap in a different database backend.
| Pros | Cons |
|---|---|
| Open source (MIT) with self-hosting support. | Self-hosting requires ClickHouse expertise and requires running 5+ services. |
| Playground experience for manual prompt engineering | Key features like SSO, RBAC, and advanced evaluation are gated behind paid plans. |
| Strong analytics backend with ClickHouse | Steep operational overhead and frequent architecture changes in the past. |
The core is MIT-licensed, but enterprise features in the ee folders have separate licensing. Some capabilities like SSO and advanced RBAC require a paid plan.
Native SDKs are available for Python and TypeScript. Other languages require building custom wrappers around the REST API.
No. Langfuse's self-hosted version requires ClickHouse for trace storage. There is no option to swap in PostgreSQL, MySQL, or another analytical database.
Langfuse was acquired by ClickHouse, Inc. The long-term product roadmap and investment level remain to be seen. Check the official Langfuse blog for the latest updates.
LangSmith is the commercial observability platform built by LangChain. It provides detailed tracing, evaluation, and monitoring capabilities with strong support for LangChain and LangGraph applications, though teams using other frameworks may find the experience less polished.

LangSmith provides the richest tracing detail for applications built on LangChain and LangGraph, with native agent graph visualization and annotation queues for structured human review. If your stack is LangChain-centric, the first-party experience is polished.
LangSmith provides a rich set of AI-powered features, including Polly AI Assistant, topic clustering, and Insights Agent, which use LLMs to analyze your trace data on your behalf. Some of these features are available only in paid plans (Plus or Enterprise).
LangSmith is a proprietary SaaS platform and there is no self-hosted option outside enterprise contracts. The free tier is limited to 5,000 traces per month with 14-day retention, and per-seat pricing ($39/seat/month on Plus) can limit access for PMs and QA.
| Pros | Cons |
|---|---|
| Strong integration with LangChain/LangGraph ecosystem | Proprietary, closed-source. No self-hosted option outside enterprise contracts |
| Visual no-code agent authoring experience with LangSmith Studio | Per-seat pricing can limit collaboration and sharing across teams |
| AI-powered features like Polly AI Assistant, topic clustering, and insights agent for automated trace analysis | Feature parity lags for integrations outside the LangChain ecosystem |
No. LangSmith supports other frameworks via OpenTelemetry ingestion and a traceable wrapper. However, community feedback suggests the experience is most polished for LangChain and LangGraph applications.
Self-hosting is only available on the Enterprise tier. The free and Plus plans are SaaS-only with data stored on LangChain's infrastructure.
The free tier includes 5,000 traces/month with 14-day retention. Plus is $39/seat/month with higher limits. Extended retention (400 days) is available as a paid add-on. Enterprise pricing is custom.
LangSmith offers Polly AI Assistant for natural-language trace debugging, topic clustering for automatic behavior categorization, and an insights agent that prioritizes improvements by frequency and impact. Some of these features are gated behind Plus or Enterprise plans.
Arize Phoenix is the open source observability tool from Arize AI, a company that started in classical ML monitoring and is now expanding into the GenAI space. That monitoring heritage shows in Phoenix's strengths: built-in evaluation metrics, drift detection, and trace analytics.

Phoenix ships with 50+ research-backed metrics covering faithfulness, relevance, safety, toxicity, and hallucination detection. Multi-step agent trajectory analysis helps teams understand complex agent behavior, and advanced analytics include trace clustering, anomaly detection, and retrieval relevancy visualization for RAG pipelines.
Phoenix owns the OpenInference standard, a set of custom instrumentation SDKs for OpenTelemetry that provide framework-native tracing across 40+ integrations. The open source version (Phoenix) is available for self-hosting on a single node.
Phoenix uses the Elastic License 2.0 (ELv2), which restricts offering the software as a managed service. High-value features like the Alyx Copilot, online evaluations, and monitoring are gated behind paid plans. Phoenix does not offer prompt optimization, an AI gateway, or governance capabilities, and scaling beyond single-node deployments requires additional planning. The project is backed by Arize AI, so its long-term roadmap may be influenced by commercial priorities.
| Pros | Cons |
|---|---|
| Source-available Phoenix is available for self-hosting | High-value features are gated behind paid plans, such as Alyx Copilot and online evaluations. |
| Strong set of research-backed evaluation metrics out of the box | Evaluation options outside the built-in metrics are limited, such as multi-turn evaluation |
| Owns OpenInference, a set of custom instrumentation SDKs for OpenTelemetry that provide framework-native tracing | Elastic License 2.0 restricts the use of the software as a managed service |
Phoenix is source-available under the Elastic License 2.0 (ELv2), which allows free use but restricts offering the software as a managed service. This is a more restrictive license than Apache 2.0 or MIT.
OpenInference is a set of custom instrumentation SDKs built on top of OpenTelemetry, maintained by Arize AI. It provides framework-native tracing for 40+ integrations including LlamaIndex, LangChain, and DSPy.
The open source version is designed for single-node deployment. Scaling beyond that requires the commercial Arize AX platform, which offers managed cloud hosting with tiered pricing.
Phoenix focuses on research-backed metrics (faithfulness, toxicity, hallucination detection) and works well for RAG evaluation. MLflow covers a broader evaluation surface including multi-turn evaluation, LLM judge alignment with human feedback, and automated prompt optimization based on eval results. The two can be used together since MLflow integrates with Phoenix as an eval library.
Braintrust is a commercial AI observability platform designed for speed and ease of use, targeting teams where not everyone is deeply technical. Its purpose-built database (Brainstore) can efficiently analyze production traces, and its AI proxy provides automatic logging of LLM calls with minimal setup, though deeper agent-level tracing still requires SDK instrumentation.

Braintrust's purpose-built Brainstore database is designed for AI workload patterns, delivering fast query performance over production traces. The UI is approachable for prompt iteration and output comparison, with 25+ built-in scorers and the ability to generate custom scorers from natural language descriptions.
The recent addition of AI Gateway (formerly known as AI proxy) lets you route requests to many LLM providers with a unified LLM API. It provides basic features for teams to manage LLM access, such as caching, logging, and access control.
Braintrust is a proprietary SaaS platform with no self-hosted option and trace data stays with the vendor. The jump from free to paid tiers is steep ($249/mo for Pro). It does not offer built-in prompt optimization or a broader governance layer, and framework integration coverage is narrower than platforms with native OpenTelemetry support.
| Pros | Cons |
|---|---|
| Fast analytics on high-volume traces with purpose-built database | Proprietary SaaS with no self-hosted option. Trace data stays with the vendor. |
| Approachable UI for prompt iteration and non-technical stakeholders | Steep pricing jump from free to paid tiers ($249/mo for Pro) |
| AI proxy provides automatic LLM call logging with minimal setup | Narrow integration coverage for agent frameworks. |
No. Braintrust is a proprietary SaaS platform with no self-hosted option. All trace data is stored on Braintrust's infrastructure.
Brainstore is Braintrust's purpose-built database designed for AI workload patterns. It enables fast analytical queries over millions of production traces.
Braintrust offers a free tier with limited usage. The jump to Pro is $249/month, which can be steep for smaller teams. There is no public Enterprise pricing.
All trace data is stored on Braintrust's infrastructure with no option to self-host or bring your own storage. Data retention is 14 days on the Starter plan and 30 days on Pro. Teams that need full control over trace data ownership should consider an open source alternative.
All five tools on this list can capture traces. The difference lies in what happens after you collect that data, and how much control you retain over it. Before picking a tool, consider these criteria:
Traces are among the most valuable assets an AI team generates. They encode how your agent reasons, what tools it calls, and where it fails. Ask yourself: who owns this data? Can you store it on your own infrastructure? Can you query it with your own tools, or are you locked into a vendor's retention window and export format?
Observability is only the first step. To ship reliable agents you also need evaluation, governance, and an AI gateway, ideally within the same platform so insights flow naturally from traces to improvements. A tool that only does tracing will force you to stitch together separate solutions for each of these stages, increasing complexity and maintenance cost.
Agent frameworks evolve fast. The tool you choose should not tie you to a single framework, a single database, or a single vendor. Native OpenTelemetry support, a permissive open source license, and simple self-hosting options all protect you from lock-in as your stack changes.
For teams who care about trace data ownership and want to get the most value from that data to build production-grade agents, MLflow is our top recommendation. It is the only tool on this list that is fully open source under the Apache 2.0 license, backed by the Linux Foundation, and offers observability, evaluation, governance, and an AI gateway in a single platform, with no enterprise paywall on any feature. Get started with the Tracing Quickstart.
Langfuse is a reasonable self-hosted alternative if your team is already invested in ClickHouse, but it covers tracing and prompt management only; the stack tends to grow as you add LiteLLM and separate evaluation tools. LangSmith provides the deepest LangChain/LangGraph integration, though it is proprietary and pricing scales with volume. Braintrust suits fast prototyping with non-technical stakeholders, but it is a proprietary SaaS with no self-hosting option.
Agent observability is end-to-end visibility into every step an AI agent takes in production: LLM calls, tool invocations, retrieval steps, planning decisions, and the cascading effects between them. Unlike traditional APM that monitors latency and errors, agent observability tracks output quality, faithfulness, safety, and behavioral drift. Learn more on the AI Observability page.
Traces are among the most valuable data an AI team generates. Open source ensures you own that data, can self-host on your own infrastructure, avoid vendor lock-in, and maintain full transparency into how your observability stack works. Learn more on the AI Platform page.
OpenTelemetry is an open source project that provides a vendor-neutral standard for collecting, processing, and exporting telemetry data. It is widely used across observability tools, and choosing OpenTelemetry-compatible platforms helps keep your trace data portable. Learn more on the OpenTelemetry website.
There are other important concepts in LLMOps, such as prompt management, cost control, and governance. See the full guide on the LLMOps page.
Adoption is usually quick, especially with a tool like MLflow that is designed to be easy to use and self-host. For example, to instrument an application built with the OpenAI Agents SDK, you just need to add a single `mlflow.openai.autolog()` call to your application code. This is why most teams start with observability as a first step in their LLMOps journey.
Traditional APM focuses on monitoring application performance and health, including latency, errors, and throughput. Agent observability focuses on agent behavior, including output quality, tool usage, and planning decisions. It gives you a more complete view of how the agent behaves in production.