What is AI agent observability?

AI agent observability is the practice of capturing, indexing and analyzing every interaction in an agent run — the prompt sent to the model, every tool call and its result, every reasoning step, latency, token usage and cost — so you can debug, replay and evaluate agents in production. It's the LLM-native cousin of APM (application performance monitoring), built around traces of nested model + tool calls rather than HTTP spans.

What are the best AI agent observability platforms in 2026?

The four most-used platforms are LangSmith (LangChain's first-party), Langfuse (open-source, self-hostable), Helicone (proxy-based, lightest integration) and Arize / Phoenix (heavier enterprise ML platform). LangSmith is the default if you're on LangGraph; Langfuse if you want self-hosting and framework-agnostic tooling; Helicone if you want one-line setup; Arize for regulated enterprise with drift detection requirements.

What's the difference between LLM observability and agent observability?

LLM observability traces single model calls — input prompt, output, tokens, latency. Agent observability traces full multi-step runs — the parent trace contains nested spans for each model call, tool call and sub-agent, with state passed between them. Agent observability has to handle branching, retries, parallel tool calls and multi-agent handoffs, none of which classical LLM logging captures cleanly.

How much overhead does observability add to my agent?

Proxy-based platforms (Helicone) add 5–30 ms per model call because the request hops through an extra service. SDK-based platforms (LangSmith, Langfuse) add roughly 0–5 ms because logging is async. Self-hosted Langfuse with a local Redis queue is effectively zero overhead on the hot path. For most agents the overhead is invisible next to model latency (2–15 s).

Can I self-host AI agent observability?

Yes — Langfuse and Phoenix (the open-source side of Arize) are designed for self-hosting and ship Docker / Helm / Terraform manifests. LangSmith is cloud-only except for its enterprise self-hosted tier. Helicone offers a self-hosted option but most users stay on the cloud proxy. Self-hosting matters most for regulated industries that can't ship prompts containing PII through a third-party SaaS.

AI Agent Observability 2026: LangSmith vs Langfuse vs Helicone vs Arize

LangSmith, Langfuse, Helicone and Arize are the four AI agent observability platforms most production teams reach for in 2026. They look superficially similar — capture traces, log tokens, surface failures — but pick wrong and you'll fight your tooling for a year. This guide is the head-to-head: what each one traces, integration weight, pricing, self-hosting, eval features and the buying call for each company size.

Without observability, debugging an agent in production is impossible. We mean impossible literally — agent runs are stochastic, multi-step, partially-cached, and frequently fan out to several tools and sub-agents. A "what happened?" question that takes 30 seconds in a traditional API stack takes 30 minutes (or never resolves) in an unobserved agent stack.

This article sits next to our agent stack reference architecture and our agent evaluation guide. For the broader concept, see LLM observability and agent observability in the glossary.

The four platforms at a glance

	LangSmith	Langfuse	Helicone	Arize / Phoenix
Maker	LangChain	Langfuse GmbH	Helicone	Arize AI
Open source	No (SaaS-only, self-host enterprise tier)	Yes (MIT + paid cloud)	Yes (cloud + self-host)	Phoenix is OSS; Arize is SaaS
Integration model	SDK / LangGraph callback	SDK + OpenTelemetry	HTTP proxy / SDK	SDK / OpenTelemetry
Setup time	5 min	10 min (cloud), 30 min (self-host)	1 min	15–30 min
Framework lock-in	Strong toward LangChain stack	None	None	None
Eval features	Strong (built in)	Strong (LLM-as-judge + datasets)	Light (added 2025)	Strong (ML-flavored)
Self-host story	Enterprise tier only	Free + production-ready	Optional, supported	Phoenix free, Arize SaaS
Pricing entry point	Free up to 5K traces/mo	Free up to 50K events/mo	Free up to 100K req/mo	Phoenix free; Arize is enterprise
Best for	LangGraph teams	Framework-agnostic, self-host	Drop-in lightweight	Regulated / ML-mature shops

Why agent observability is its own category

Classical APM (Datadog, New Relic) traces HTTP spans. LLM logging tools trace prompt + completion pairs. Neither captures the structure of a real agent run, which looks more like this:

Trace: trace_a91 — "Refund request from customer u_42"
 ├─ Span 1: classify_intent (model: haiku, 38 ms, 142 tokens)
 ├─ Span 2: ReAct loop (model: sonnet, 4 turns, 11.2 s, 4,820 tokens)
 │   ├─ Turn 1 thought
 │   ├─ Turn 1 tool call: billing.get_invoices(u_42) — 88 ms
 │   ├─ Turn 1 observation
 │   ├─ Turn 2 thought
 │   ├─ Turn 2 tool call: policy.lookup_refund_eligibility — 22 ms
 │   ├─ Turn 2 observation
 │   ├─ Turn 3 thought
 │   ├─ Turn 3 tool call: refunds.create_draft(amount=42.10) — 311 ms
 │   └─ Turn 3 observation
 ├─ Span 3: human_in_the_loop wait (4 min 11 s)
 └─ Span 4: refunds.confirm(draft_id=…) — 89 ms

Total: 4 min 27 s, $0.041 in tokens, 1 human review

Agent observability has to handle nested spans, asynchronous human-in-the-loop pauses, branched tool calls, retries, and the fact that the same logical "run" can span minutes of wall-clock with model calls scattered through it. That's why purpose-built tools exist.

What you must log per run

The non-negotiable fields, before you pick a vendor:

Trace ID propagated end-to-end across every model call and sub-agent.
Full input prompt including system prompt and tool definitions (not just the user message).
Each tool call — name, parameters (with secrets redacted), response, latency.
Each model response — reasoning, tool calls emitted, final answer.
Token counts — prompt / completion / cached, by model.
Cost computed per call and rolled up to the trace.
Errors and retries with stack and attempt number.
User feedback if any (thumbs up/down, edits to output, support tickets back-linking).
Eval scores when evals run on this trace (judge model, rubric version, score).
A replay-ready snapshot so you can re-run the same trace against a candidate change.

If your observability vendor can't capture all ten, walk.

LangSmith — the default if you're on LangGraph

Maker: LangChain. Integration: Drop-in callback for LangChain / LangGraph; SDK for raw OpenAI/Anthropic calls.

Strengths:

Tightest integration with LangGraph — every node in your graph becomes a span automatically.
First-class eval product (datasets, regression runs, LLM-as-judge).
Prompt registry — versioned prompts, A/B between them in production.
Annotation queue for human review at scale.

Weaknesses:

Cloud-only on the free tier. Self-host requires the enterprise plan.
Stronger fit for LangChain stacks than for OpenAI Agents SDK, CrewAI or your own framework.
Pricing scales aggressively at high trace volume.

Pricing (2026): Free up to 5,000 traces/month, paid plans from ~$39/seat/month + usage. Enterprise self-hosted: custom.

Pick LangSmith if: you're on LangGraph and have engineering bandwidth to use the eval product fully.

Langfuse — the framework-agnostic choice

Maker: Langfuse GmbH. Integration: SDK (Python/JS/Go), OpenTelemetry, decorator pattern, also drop-in for major frameworks.

Strengths:

Open source (MIT). Self-hosting is a first-class story — Docker Compose for dev, Helm for production.
Framework-agnostic. Works equally well with LangGraph, OpenAI Agents SDK, CrewAI, AutoGen or your own loop.
Strong eval suite — datasets, LLM-as-judge, programmatic evals.
Prompt management with versioning.
Generous free tier (50K events/month on cloud).

Weaknesses:

The cloud UX is slightly less polished than LangSmith.
Multi-modal traces (vision, audio) still maturing.

Pricing (2026): Free cloud up to 50K events/month, paid plans from ~$59/month. Self-host: free, you pay your own infra (typically a single Postgres + Redis + Clickhouse stack).

Pick Langfuse if: you want self-host, framework-agnostic, or are cost-sensitive at scale.

Helicone — the one-line drop-in

Maker: Helicone (YC W23). Integration: HTTP proxy — change your base URL to oai.helicone.ai (or equivalent) and you're done. Also offers SDK for richer traces.

Strengths:

Easiest setup of the four. Genuinely one line of code.
Excellent cost dashboards out of the box.
Caching layer can reduce model spend 20–60% on workloads with repeated prompts. See prompt caching.
Rate-limiting and request hedging built in.
Strong for pure-LLM workloads moving toward agents.

Weaknesses:

Proxy-based architecture adds 5–30 ms per call.
Less rich agent-specific tracing than LangSmith / Langfuse — spans are flatter.
Eval product is newer (added 2025, still catching up).
Self-host exists but is less battle-tested.

Pricing (2026): Free up to 100K requests/month, paid from $0.0005/request after. Self-host: free.

Pick Helicone if: you want to add observability today with minimum integration work, or you mainly care about cost / caching.

Arize / Phoenix — for ML-mature enterprises

Maker: Arize AI. Integration: SDK, OpenTelemetry. Phoenix is the open-source local version; Arize is the SaaS enterprise tier.

Strengths:

Deepest drift detection and ML-quality monitoring story — inherited from Arize's classical ML observability heritage.
Strong RAG-specific evaluators (faithfulness, answer relevance, context recall).
Enterprise governance: SSO, RBAC, audit logs, data residency controls.
Phoenix (the OSS side) is excellent for local debugging.

Weaknesses:

Heavier integration than the other three.
More features than most startups will use.
Pricing is enterprise-shaped; not the right tool for a 3-person team.

Pricing (2026): Phoenix is free / self-host. Arize is custom enterprise (typically starts well into 5-figures/year).

Pick Arize if: you're a regulated industry, large multi-tenant SaaS, or already on Arize for classical ML.

Head-to-head on the questions that matter

"Can I see the full prompt my agent sent?"

All four — yes. But LangSmith and Langfuse render multi-turn prompts most readably; Helicone's view is flatter; Arize's is the most technical.

"Can I replay a production trace against a new prompt?"

LangSmith and Langfuse — yes, first-class feature. Helicone — partial (replay request, not full agent). Arize — yes via Phoenix.

"Can I run evals on production traffic?"

LangSmith, Langfuse, Arize — yes, all support sampling production traces and running LLM-as-judge or programmatic evals. Helicone — basic sampling, less rich evals.

"Does it support multi-agent traces?"

LangSmith — yes, natively for LangGraph. Langfuse — yes, via OpenTelemetry or SDK. Helicone — limited (you have to thread the trace ID yourself). Arize — yes.

"Can I redact PII before it leaves my infra?"

Self-hosted Langfuse — yes, you control everything. Self-hosted Phoenix — yes. LangSmith enterprise self-host — yes. Helicone — partial (server-side redaction rules). Cloud SaaS of any of them — only what their redaction rules support.

Buying call by company size

Solo builder / pre-seed startup: Helicone or Langfuse free tier. Setup matters more than feature depth at this stage.

Seed / Series A (3–20 engineers): Langfuse cloud or LangSmith. Pick LangSmith if you're already on LangGraph; pick Langfuse otherwise. Start using the eval product within the first month.

Series B and up (20–100 engineers): Langfuse self-hosted, LangSmith enterprise, or Arize enterprise — driven by self-host requirements and existing ML stack. Run two in parallel for 30 days if you can.

Regulated enterprise (healthcare, finance, public sector): Arize, or Langfuse self-hosted on your own VPC. Cloud-only SaaS is usually a non-starter once compliance reviews start.

For broader buying context see how to pick an AI agent, how to evaluate AI agent and our methodology.

Integration checklist before you commit

Before you make any of these the default in your stack, verify:

If two of the four platforms pass this checklist, take both for a 30-day spin in parallel before committing. The cost of switching observability vendors a year in is enormous — far higher than running two in parallel for a month.

What this means for buyers of AI agents

When you evaluate a vendor on our leaderboard for serious production use, observability is now an explicit scoring axis. The questions to ask:

Does the vendor expose traces of your runs in any of the four platforms above? If they can't even export, that's a red flag.
What's the per-run audit shape? A "screenshot of the dashboard" isn't an audit trail.
Can your security team self-host the observability layer if needed?
What's the retention policy on the vendor's side? GDPR / HIPAA / SOX will dictate.

The agent landscape is full of vendors who treat observability as an afterthought. Their pitches sound great in demos and fall apart in production. The vendors who win in 2026 — and the ones we score highest in our methodology — are the ones who'd let you watch the agent think live, with the receipts to back up every claim.

AI Agent Observability 2026: LangSmith vs Langfuse vs Helicone vs Arize

The four platforms at a glance

Why agent observability is its own category

What you must log per run

LangSmith — the default if you're on LangGraph

Langfuse — the framework-agnostic choice

Helicone — the one-line drop-in

Arize / Phoenix — for ML-mature enterprises

Head-to-head on the questions that matter

"Can I see the full prompt my agent sent?"

"Can I replay a production trace against a new prompt?"

"Can I run evals on production traffic?"

"Does it support multi-agent traces?"

"Can I redact PII before it leaves my infra?"

Buying call by company size

Integration checklist before you commit

What this means for buyers of AI agents

Agents mentioned in this post

Keep exploring

Head-to-head comparisons

By industry

By role

Terms used in this post

More from the blog

The 2026 AI Agent Stack: Reference Architecture Buyers Can Actually Use

Agentic AI Design Patterns 2026: The 9 AI Agent Patterns You Need

AI Agent Memory in 2026: Vector, Episodic and Semantic — Explained

AIエージェント比較 2026 — おすすめ7選とカテゴリー別の選び方

AIエージェントとは何か？2026年の現在地と実用化ガイド

Comment créer un agent IA en 2026 — le guide complet

The four platforms at a glance

Why agent observability is its own category

What you must log per run

LangSmith — the default if you're on LangGraph

Langfuse — the framework-agnostic choice

Helicone — the one-line drop-in

Arize / Phoenix — for ML-mature enterprises

Head-to-head on the questions that matter

"Can I see the full prompt my agent sent?"

"Can I replay a production trace against a new prompt?"

"Can I run evals on production traffic?"

"Does it support multi-agent traces?"

"Can I redact PII before it leaves my infra?"

Buying call by company size

Integration checklist before you commit

What this means for buyers of AI agents

Agents mentioned in this post

Keep exploring

Head-to-head comparisons

By industry

By role

Terms used in this post

More from the blog

The 2026 AI Agent Stack: Reference Architecture Buyers Can Actually Use

Agentic AI Design Patterns 2026: The 9 AI Agent Patterns You Need

AI Agent Memory in 2026: Vector, Episodic and Semantic — Explained

AIエージェント 比較 2026 — おすすめ7選とカテゴリー別の選び方

AIエージェントとは何か？2026年の現在地と実用化ガイド

Comment créer un agent IA en 2026 — le guide complet

AIエージェント比較 2026 — おすすめ7選とカテゴリー別の選び方