LangSmith, Langfuse, Helicone and Arize are the four AI agent observability platforms most production teams reach for in 2026. They look superficially similar — capture traces, log tokens, surface failures — but pick wrong and you'll fight your tooling for a year. This guide is the head-to-head: what each one traces, integration weight, pricing, self-hosting, eval features and the buying call for each company size.
Without observability, debugging an agent in production is impossible. We mean impossible literally — agent runs are stochastic, multi-step, partially-cached, and frequently fan out to several tools and sub-agents. A "what happened?" question that takes 30 seconds in a traditional API stack takes 30 minutes (or never resolves) in an unobserved agent stack.
This article sits next to our agent stack reference architecture and our agent evaluation guide. For the broader concept, see LLM observability and agent observability in the glossary.
The four platforms at a glance
| LangSmith | Langfuse | Helicone | Arize / Phoenix | |
|---|---|---|---|---|
| Maker | LangChain | Langfuse GmbH | Helicone | Arize AI |
| Open source | No (SaaS-only, self-host enterprise tier) | Yes (MIT + paid cloud) | Yes (cloud + self-host) | Phoenix is OSS; Arize is SaaS |
| Integration model | SDK / LangGraph callback | SDK + OpenTelemetry | HTTP proxy / SDK | SDK / OpenTelemetry |
| Setup time | 5 min | 10 min (cloud), 30 min (self-host) | 1 min | 15–30 min |
| Framework lock-in | Strong toward LangChain stack | None | None | None |
| Eval features | Strong (built in) | Strong (LLM-as-judge + datasets) | Light (added 2025) | Strong (ML-flavored) |
| Self-host story | Enterprise tier only | Free + production-ready | Optional, supported | Phoenix free, Arize SaaS |
| Pricing entry point | Free up to 5K traces/mo | Free up to 50K events/mo | Free up to 100K req/mo | Phoenix free; Arize is enterprise |
| Best for | LangGraph teams | Framework-agnostic, self-host | Drop-in lightweight | Regulated / ML-mature shops |
Why agent observability is its own category
Classical APM (Datadog, New Relic) traces HTTP spans. LLM logging tools trace prompt + completion pairs. Neither captures the structure of a real agent run, which looks more like this:
Trace: trace_a91 — "Refund request from customer u_42"
├─ Span 1: classify_intent (model: haiku, 38 ms, 142 tokens)
├─ Span 2: ReAct loop (model: sonnet, 4 turns, 11.2 s, 4,820 tokens)
│ ├─ Turn 1 thought
│ ├─ Turn 1 tool call: billing.get_invoices(u_42) — 88 ms
│ ├─ Turn 1 observation
│ ├─ Turn 2 thought
│ ├─ Turn 2 tool call: policy.lookup_refund_eligibility — 22 ms
│ ├─ Turn 2 observation
│ ├─ Turn 3 thought
│ ├─ Turn 3 tool call: refunds.create_draft(amount=42.10) — 311 ms
│ └─ Turn 3 observation
├─ Span 3: human_in_the_loop wait (4 min 11 s)
└─ Span 4: refunds.confirm(draft_id=…) — 89 ms
Total: 4 min 27 s, $0.041 in tokens, 1 human review
Agent observability has to handle nested spans, asynchronous human-in-the-loop pauses, branched tool calls, retries, and the fact that the same logical "run" can span minutes of wall-clock with model calls scattered through it. That's why purpose-built tools exist.
What you must log per run
The non-negotiable fields, before you pick a vendor:
- Trace ID propagated end-to-end across every model call and sub-agent.
- Full input prompt including system prompt and tool definitions (not just the user message).
- Each tool call — name, parameters (with secrets redacted), response, latency.
- Each model response — reasoning, tool calls emitted, final answer.
- Token counts — prompt / completion / cached, by model.
- Cost computed per call and rolled up to the trace.
- Errors and retries with stack and attempt number.
- User feedback if any (thumbs up/down, edits to output, support tickets back-linking).
- Eval scores when evals run on this trace (judge model, rubric version, score).
- A replay-ready snapshot so you can re-run the same trace against a candidate change.
If your observability vendor can't capture all ten, walk.
LangSmith — the default if you're on LangGraph
Maker: LangChain. Integration: Drop-in callback for LangChain / LangGraph; SDK for raw OpenAI/Anthropic calls.
Strengths:
- Tightest integration with LangGraph — every node in your graph becomes a span automatically.
- First-class eval product (datasets, regression runs, LLM-as-judge).
- Prompt registry — versioned prompts, A/B between them in production.
- Annotation queue for human review at scale.
Weaknesses:
- Cloud-only on the free tier. Self-host requires the enterprise plan.
- Stronger fit for LangChain stacks than for OpenAI Agents SDK, CrewAI or your own framework.
- Pricing scales aggressively at high trace volume.
Pricing (2026): Free up to 5,000 traces/month, paid plans from ~$39/seat/month + usage. Enterprise self-hosted: custom.
Pick LangSmith if: you're on LangGraph and have engineering bandwidth to use the eval product fully.
Langfuse — the framework-agnostic choice
Maker: Langfuse GmbH. Integration: SDK (Python/JS/Go), OpenTelemetry, decorator pattern, also drop-in for major frameworks.
Strengths:
- Open source (MIT). Self-hosting is a first-class story — Docker Compose for dev, Helm for production.
- Framework-agnostic. Works equally well with LangGraph, OpenAI Agents SDK, CrewAI, AutoGen or your own loop.
- Strong eval suite — datasets, LLM-as-judge, programmatic evals.
- Prompt management with versioning.
- Generous free tier (50K events/month on cloud).
Weaknesses:
- The cloud UX is slightly less polished than LangSmith.
- Multi-modal traces (vision, audio) still maturing.
Pricing (2026): Free cloud up to 50K events/month, paid plans from ~$59/month. Self-host: free, you pay your own infra (typically a single Postgres + Redis + Clickhouse stack).
Pick Langfuse if: you want self-host, framework-agnostic, or are cost-sensitive at scale.
Helicone — the one-line drop-in
Maker: Helicone (YC W23).
Integration: HTTP proxy — change your base URL to oai.helicone.ai (or equivalent) and you're done. Also offers SDK for richer traces.
Strengths:
- Easiest setup of the four. Genuinely one line of code.
- Excellent cost dashboards out of the box.
- Caching layer can reduce model spend 20–60% on workloads with repeated prompts. See prompt caching.
- Rate-limiting and request hedging built in.
- Strong for pure-LLM workloads moving toward agents.
Weaknesses:
- Proxy-based architecture adds 5–30 ms per call.
- Less rich agent-specific tracing than LangSmith / Langfuse — spans are flatter.
- Eval product is newer (added 2025, still catching up).
- Self-host exists but is less battle-tested.
Pricing (2026): Free up to 100K requests/month, paid from $0.0005/request after. Self-host: free.
Pick Helicone if: you want to add observability today with minimum integration work, or you mainly care about cost / caching.
Arize / Phoenix — for ML-mature enterprises
Maker: Arize AI. Integration: SDK, OpenTelemetry. Phoenix is the open-source local version; Arize is the SaaS enterprise tier.
Strengths:
- Deepest drift detection and ML-quality monitoring story — inherited from Arize's classical ML observability heritage.
- Strong RAG-specific evaluators (faithfulness, answer relevance, context recall).
- Enterprise governance: SSO, RBAC, audit logs, data residency controls.
- Phoenix (the OSS side) is excellent for local debugging.
Weaknesses:
- Heavier integration than the other three.
- More features than most startups will use.
- Pricing is enterprise-shaped; not the right tool for a 3-person team.
Pricing (2026): Phoenix is free / self-host. Arize is custom enterprise (typically starts well into 5-figures/year).
Pick Arize if: you're a regulated industry, large multi-tenant SaaS, or already on Arize for classical ML.
Head-to-head on the questions that matter
"Can I see the full prompt my agent sent?"
All four — yes. But LangSmith and Langfuse render multi-turn prompts most readably; Helicone's view is flatter; Arize's is the most technical.
"Can I replay a production trace against a new prompt?"
LangSmith and Langfuse — yes, first-class feature. Helicone — partial (replay request, not full agent). Arize — yes via Phoenix.
"Can I run evals on production traffic?"
LangSmith, Langfuse, Arize — yes, all support sampling production traces and running LLM-as-judge or programmatic evals. Helicone — basic sampling, less rich evals.
"Does it support multi-agent traces?"
LangSmith — yes, natively for LangGraph. Langfuse — yes, via OpenTelemetry or SDK. Helicone — limited (you have to thread the trace ID yourself). Arize — yes.
"Can I redact PII before it leaves my infra?"
Self-hosted Langfuse — yes, you control everything. Self-hosted Phoenix — yes. LangSmith enterprise self-host — yes. Helicone — partial (server-side redaction rules). Cloud SaaS of any of them — only what their redaction rules support.
Buying call by company size
Solo builder / pre-seed startup: Helicone or Langfuse free tier. Setup matters more than feature depth at this stage.
Seed / Series A (3–20 engineers): Langfuse cloud or LangSmith. Pick LangSmith if you're already on LangGraph; pick Langfuse otherwise. Start using the eval product within the first month.
Series B and up (20–100 engineers): Langfuse self-hosted, LangSmith enterprise, or Arize enterprise — driven by self-host requirements and existing ML stack. Run two in parallel for 30 days if you can.
Regulated enterprise (healthcare, finance, public sector): Arize, or Langfuse self-hosted on your own VPC. Cloud-only SaaS is usually a non-starter once compliance reviews start.
For broader buying context see how to pick an AI agent, how to evaluate AI agent and our methodology.
Integration checklist before you commit
Before you make any of these the default in your stack, verify:
- Traces capture all 10 fields listed earlier in this guide.
- Multi-agent traces show the parent-child relationship correctly.
- Tool calls are spans (not buried inside text logs).
- Token cost is computed per call AND rolled up per trace.
- PII redaction policy works on your test set.
- Latency overhead on a hot path stays under your SLA budget.
- Retention policy matches your data governance rules.
- Export to S3 / BigQuery / Snowflake exists for long-term analytics.
- Webhooks fire on traces that match alert rules.
- Replay against a candidate change works on your real trace shape.
If two of the four platforms pass this checklist, take both for a 30-day spin in parallel before committing. The cost of switching observability vendors a year in is enormous — far higher than running two in parallel for a month.
What this means for buyers of AI agents
When you evaluate a vendor on our leaderboard for serious production use, observability is now an explicit scoring axis. The questions to ask:
- Does the vendor expose traces of your runs in any of the four platforms above? If they can't even export, that's a red flag.
- What's the per-run audit shape? A "screenshot of the dashboard" isn't an audit trail.
- Can your security team self-host the observability layer if needed?
- What's the retention policy on the vendor's side? GDPR / HIPAA / SOX will dictate.
The agent landscape is full of vendors who treat observability as an afterthought. Their pitches sound great in demos and fall apart in production. The vendors who win in 2026 — and the ones we score highest in our methodology — are the ones who'd let you watch the agent think live, with the receipts to back up every claim.