Back

LLM Observability Deep Dive: How to Monitor, Trace, and Debug AI Agents in Production

Your AI agent just cost a customer $2,400. It entered an infinite tool-calling loop at 3 AM, burning through tokens while generating nonsensical responses. Your traditional APM dashboard shows everything green โ€” latency is normal, no errors, no crashes. But the agent has been confidently wrong for six hours straight, and you had zero visibility into it.

This is the observability gap that kills AI products. Traditional monitoring tools were built for deterministic software: request in, response out, measure the time. AI agents are fundamentally different. They reason, branch, call tools, retrieve documents, and make decisions that vary across identical inputs. When something goes wrong, you can't just check the HTTP status code. You need to trace the reasoning chain โ€” every decision point, every tool invocation, every token consumed.

This guide covers everything you need to build production-grade observability for LLM-powered systems: from distributed tracing and automated evaluations to cost tracking and the tooling landscape. No theory. Battle-tested patterns from teams running agents that handle millions of requests per day.

Why Traditional Monitoring Fails for LLM Applications

If you're running AI agents with Datadog, Grafana, or New Relic alone, you're flying blind. Here's why:

The Determinism Problem

Traditional software is deterministic. Given the same input, you get the same output. Monitoring is straightforward: track latency, error rates, and throughput. If the P99 latency spikes, you investigate.

LLMs are non-deterministic. The same prompt can produce different outputs every time. A "successful" HTTP 200 response might contain a completely hallucinated answer. Your error rate is 0%, but your accuracy is 40%. Traditional APM tools literally cannot detect this failure mode.

The Multi-Step Problem

A simple API call is a single span: request โ†’ response. An AI agent is a complex execution graph:

User Query: "Find the cheapest flight from NYC to Tokyo next month"
โ”‚
โ”œโ”€ Step 1: Intent Classification (LLM call, 200ms, 150 tokens)
โ”œโ”€ Step 2: Parameter Extraction (LLM call, 180ms, 120 tokens)
โ”œโ”€ Step 3: Tool Call - Flight API (external API, 2.1s)
โ”œโ”€ Step 4: Result Parsing (LLM call, 250ms, 800 tokens)
โ”œโ”€ Step 5: Price Comparison (LLM call, 300ms, 1200 tokens)
โ”œโ”€ Step 6: Response Generation (LLM call, 400ms, 500 tokens)
โ”‚
Total: 5 LLM calls, 3.4s, 2770 tokens, $0.008

When this agent returns wrong results, which step failed? Was it the intent classification? Did the tool return bad data? Did the LLM hallucinate during price comparison? Without step-level tracing, debugging is impossible.

The Cost Problem

LLM calls are expensive. Unlike traditional compute where CPU cycles are essentially free, every token has a direct dollar cost. A single runaway agent loop can burn through hundreds of dollars in minutes. You need real-time cost tracking at the agent, user, and organization level โ€” something no traditional APM tool provides.

The LLM Observability Stack

Production-grade LLM observability requires four layers:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Layer 4: DASHBOARDS                 โ”‚
โ”‚     Cost analytics, quality trends, SLA tracking     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                  Layer 3: EVALUATION                 โ”‚
โ”‚   Automated evals, regression detection, A/B tests   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                  Layer 2: TRACING                    โ”‚
โ”‚  Distributed traces, span hierarchy, token tracking  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                  Layer 1: INSTRUMENTATION            โ”‚
โ”‚   SDK integration, auto-capture, manual annotations  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Let's build each layer from the ground up.

Layer 1: Instrumentation

Instrumentation is the foundation. You need to capture data at every decision point without destroying your application's performance.

OpenTelemetry for LLMs

The industry is converging on OpenTelemetry (OTel) as the standard instrumentation layer. The OpenLLMetry project extends OTel with LLM-specific semantic conventions:

import * as traceloop from '@traceloop/node-server-sdk'; // Initialize before importing any LLM modules traceloop.initialize({ baseUrl: 'https://your-collector.example.com', appName: 'my-ai-agent', }); // Or use the modular approach with OpenTelemetry directly: import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { OpenAIInstrumentation } from '@traceloop/instrumentation-openai'; import { AnthropicInstrumentation } from '@traceloop/instrumentation-anthropic'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'https://your-collector.example.com/v1/traces', }), instrumentations: [ new OpenAIInstrumentation({ captureInputs: true, // Log prompts (careful in prod!) captureOutputs: true, // Log completions }), new AnthropicInstrumentation(), ], }); sdk.start();

This auto-instruments every OpenAI and Anthropic API call, capturing:

  • Model name and parameters (temperature, max_tokens)
  • Input prompts and output completions
  • Token usage (prompt tokens, completion tokens)
  • Latency per call
  • Tool/function call details

Manual Span Annotations

Auto-instrumentation captures the LLM calls, but you need manual spans for business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('ai-agent'); async function processUserQuery(query: string, userId: string) { return tracer.startActiveSpan('agent.process_query', async (span) => { // Add business context span.setAttributes({ 'user.id': userId, 'agent.query': query, 'agent.type': 'flight-search', }); try { // Step 1: Classify intent const intent = await tracer.startActiveSpan( 'agent.classify_intent', async (intentSpan) => { const result = await classifyIntent(query); intentSpan.setAttributes({ 'agent.intent': result.intent, 'agent.confidence': result.confidence, }); return result; } ); // Step 2: Execute tool calls const toolResults = await tracer.startActiveSpan( 'agent.execute_tools', async (toolSpan) => { toolSpan.setAttribute('agent.tools_count', intent.tools.length); return Promise.all( intent.tools.map((tool) => executeTool(tool)) ); } ); // Step 3: Generate response const response = await tracer.startActiveSpan( 'agent.generate_response', async (respSpan) => { const result = await generateResponse(toolResults); respSpan.setAttributes({ 'agent.response_length': result.length, 'agent.tokens_total': result.tokenUsage.total, }); return result; } ); span.setStatus({ code: SpanStatusCode.OK }); return response; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message, }); span.recordException(error); throw error; } }); }

What to Capture (and What Not To)

A critical decision: what data do you log?

DataCapture in DevCapture in ProdWhy
Full promptsโœ… Yesโš ๏ธ SampledPII risk, storage cost
Full completionsโœ… Yesโš ๏ธ SampledSame as above
Token countsโœ… Yesโœ… YesCost tracking is critical
Model parametersโœ… Yesโœ… YesDebugging regressions
Tool call inputs/outputsโœ… Yesโœ… YesEssential for debugging
User IDsโœ… Yesโœ… YesPer-user cost tracking
Latency per stepโœ… Yesโœ… YesPerformance monitoring
Embedding vectorsโŒ NoโŒ NoToo large, rarely useful
Raw API responsesโœ… YesโŒ NoStorage explosion

In production, use sampling for full prompt/completion logging. Capture 100% of metadata (tokens, latency, model) but only 10-20% of full text content. When debugging a specific issue, temporarily increase sampling for targeted users or queries.

Layer 2: Distributed Tracing

With instrumentation in place, you need a tracing backend that understands LLM-specific data. This is where purpose-built tools shine.

Trace Structure for AI Agents

A well-structured trace for an AI agent looks like this:

Trace: agent_run_abc123
โ”‚
โ”œโ”€ Span: agent.process_query (root)
โ”‚  โ”œโ”€ Attributes: user_id, query, session_id
โ”‚  โ”‚
โ”‚  โ”œโ”€ Span: agent.classify_intent
โ”‚  โ”‚  โ”œโ”€ Span: llm.openai.chat (model: gpt-4.1-mini)
โ”‚  โ”‚  โ”‚  โ””โ”€ Attributes: tokens_in=150, tokens_out=30, cost=$0.0001
โ”‚  โ”‚  โ””โ”€ Result: intent=flight_search, confidence=0.95
โ”‚  โ”‚
โ”‚  โ”œโ”€ Span: agent.retrieve_context (RAG)
โ”‚  โ”‚  โ”œโ”€ Span: vectordb.query (provider: pinecone)
โ”‚  โ”‚  โ”‚  โ””โ”€ Attributes: top_k=5, similarity_threshold=0.8
โ”‚  โ”‚  โ””โ”€ Span: agent.rerank
โ”‚  โ”‚     โ””โ”€ Span: llm.anthropic.chat (model: claude-haiku-4.5)
โ”‚  โ”‚        โ””โ”€ Attributes: tokens_in=2000, tokens_out=500
โ”‚  โ”‚
โ”‚  โ”œโ”€ Span: agent.execute_tool
โ”‚  โ”‚  โ”œโ”€ Span: tool.flight_api.search
โ”‚  โ”‚  โ”‚  โ””โ”€ Attributes: duration=2100ms, results_count=15
โ”‚  โ”‚  โ””โ”€ Span: tool.flight_api.get_prices
โ”‚  โ”‚     โ””โ”€ Attributes: duration=800ms, results_count=15
โ”‚  โ”‚
โ”‚  โ””โ”€ Span: agent.generate_response
โ”‚     โ””โ”€ Span: llm.openai.chat (model: gpt-4.1)
โ”‚        โ””โ”€ Attributes: tokens_in=3000, tokens_out=800, cost=$0.02
โ”‚
โ””โ”€ Total: 4 LLM calls, 6800 tokens, $0.021, 4.2s

This structure lets you answer questions like:

  • "Why did this agent take 10 seconds?" โ†’ The flight API call took 8 seconds.
  • "Why did this cost 2insteadof2 instead of 0.02?" โ†’ The agent looped 100 times on tool calls.
  • "Why did the agent hallucinate?" โ†’ The RAG retrieval returned irrelevant documents with low similarity scores.

Implementing Trace Propagation

For multi-service architectures, trace context must propagate across service boundaries:

// Service A: Agent Orchestrator import { context, propagation } from '@opentelemetry/api'; async function callToolService(toolName: string, params: any) { const headers: Record<string, string> = {}; // Inject trace context into outgoing request headers propagation.inject(context.active(), headers); const response = await fetch(`https://tools.internal/${toolName}`, { method: 'POST', headers: { ...headers, 'Content-Type': 'application/json', }, body: JSON.stringify(params), }); return response.json(); } // Service B: Tool Execution Service import { context, propagation } from '@opentelemetry/api'; app.post('/flight-search', (req, res) => { // Extract trace context from incoming request const ctx = propagation.extract(context.active(), req.headers); context.with(ctx, async () => { const span = tracer.startSpan('tool.flight_search'); // ... tool execution with full trace lineage span.end(); }); });

Layer 3: Automated Evaluation

Tracing tells you what happened. Evaluation tells you how well it happened. This is the layer most teams skip, and the layer that makes or breaks production AI.

The Eval Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Trace   โ”‚ โ†’  โ”‚  Sample  โ”‚ โ†’  โ”‚  Score   โ”‚ โ†’  โ”‚  Alert   โ”‚
โ”‚  Store   โ”‚    โ”‚  Select  โ”‚    โ”‚  Eval    โ”‚    โ”‚  Report  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 10-20% of       LLM-as-Judge    Slack/PD
                 traces          + Deterministic   if quality
                                  rules            drops

LLM-as-Judge Evaluation

The most powerful eval pattern is using a separate LLM to judge your agent's outputs:

interface EvalResult { score: number; // 0-1 reasoning: string; // Why this score dimension: string; // What was evaluated } async function evaluateResponse( query: string, response: string, groundTruth?: string ): Promise<EvalResult[]> { const evaluations: EvalResult[] = []; // Eval 1: Factual Accuracy const accuracyEval = await openai.chat.completions.create({ model: 'gpt-4.1-mini', messages: [ { role: 'system', content: `You are an expert evaluator. Score the factual accuracy of the AI response on a scale of 0 to 1. Scoring rubric: - 1.0: All facts are correct and verifiable - 0.7: Mostly correct with minor inaccuracies - 0.4: Contains significant factual errors - 0.0: Completely fabricated or wrong Respond in JSON: { "score": number, "reasoning": string }`, }, { role: 'user', content: `Query: ${query} AI Response: ${response} ${groundTruth ? `Ground Truth: ${groundTruth}` : ''}`, }, ], response_format: { type: 'json_object' }, }); evaluations.push({ ...JSON.parse(accuracyEval.choices[0].message.content), dimension: 'factual_accuracy', }); // Eval 2: Relevance const relevanceEval = await openai.chat.completions.create({ model: 'gpt-4.1-mini', messages: [ { role: 'system', content: `Score how relevant the response is to the user's query. 1.0 = Directly answers the question 0.5 = Partially relevant 0.0 = Completely off-topic Respond in JSON: { "score": number, "reasoning": string }`, }, { role: 'user', content: `Query: ${query}\nResponse: ${response}`, }, ], response_format: { type: 'json_object' }, }); evaluations.push({ ...JSON.parse(relevanceEval.choices[0].message.content), dimension: 'relevance', }); return evaluations; }

Deterministic Guards

Not everything needs an LLM judge. Use deterministic checks for known failure patterns:

interface GuardResult { passed: boolean; violation?: string; } function runDeterministicGuards( trace: AgentTrace ): GuardResult[] { const results: GuardResult[] = []; // Guard 1: Token budget exceeded const totalTokens = trace.spans .filter((s) => s.name.startsWith('llm.')) .reduce((sum, s) => sum + (s.attributes.tokens_total || 0), 0); results.push({ passed: totalTokens < 50000, violation: totalTokens >= 50000 ? `Token budget exceeded: ${totalTokens} tokens` : undefined, }); // Guard 2: Tool call loop detection const toolCalls = trace.spans .filter((s) => s.name.startsWith('tool.')); const uniqueTools = new Set(toolCalls.map((s) => s.name)); // If the same tool was called >10 times, it's likely a loop for (const tool of uniqueTools) { const count = toolCalls .filter((s) => s.name === tool).length; results.push({ passed: count <= 10, violation: count > 10 ? `Potential loop: ${tool} called ${count} times` : undefined, }); } // Guard 3: Latency budget const totalLatency = trace.duration; results.push({ passed: totalLatency < 30000, // 30 second budget violation: totalLatency >= 30000 ? `Latency budget exceeded: ${totalLatency}ms` : undefined, }); // Guard 4: Empty or suspiciously short response const finalResponse = trace.output; results.push({ passed: finalResponse && finalResponse.length > 20, violation: !finalResponse || finalResponse.length <= 20 ? 'Response is empty or suspiciously short' : undefined, }); return results; }

Automated Alerts

Wire evaluations to your alerting system:

async function runEvalPipeline(trace: AgentTrace) { // Deterministic guards (fast, run on all traces) const guardResults = runDeterministicGuards(trace); const guardViolations = guardResults .filter((r) => !r.passed); if (guardViolations.length > 0) { await sendAlert({ severity: 'high', title: 'Agent Guard Violation', details: guardViolations .map((v) => v.violation) .join('\n'), traceId: trace.traceId, }); } // LLM-as-Judge (expensive, run on sampled traces) if (shouldSample(trace, 0.1)) { // 10% sampling const evalResults = await evaluateResponse( trace.input, trace.output ); const lowScores = evalResults .filter((e) => e.score < 0.5); if (lowScores.length > 0) { await sendAlert({ severity: 'medium', title: 'Agent Quality Degradation', details: lowScores .map((e) => `${e.dimension}: ${e.score} - ${e.reasoning}` ) .join('\n'), traceId: trace.traceId, }); } // Store eval results for trend analysis await storeEvalResults(trace.traceId, evalResults); } }

Layer 4: Cost Tracking and Analytics

Token costs are the cloud bill of AI applications. Without granular cost tracking, you're guessing.

Real-Time Cost Calculation

const MODEL_PRICING: Record<string, { input: number; // per 1M tokens output: number; // per 1M tokens }> = { 'gpt-4.1': { input: 2.00, output: 8.00 }, 'gpt-4.1-mini': { input: 0.40, output: 1.60 }, 'gpt-4.1-nano': { input: 0.10, output: 0.40 }, 'claude-sonnet-4.6': { input: 3.00, output: 15.00 }, 'claude-haiku-4.5': { input: 1.00, output: 5.00 }, 'gemini-2.5-pro': { input: 1.25, output: 10.00 }, 'gemini-2.5-flash': { input: 0.30, output: 2.50 }, }; function calculateCost( model: string, inputTokens: number, outputTokens: number ): number { const pricing = MODEL_PRICING[model]; if (!pricing) return 0; return ( (inputTokens / 1_000_000) * pricing.input + (outputTokens / 1_000_000) * pricing.output ); } // Track cost per trace function aggregateTraceCost(trace: AgentTrace): CostBreakdown { const llmSpans = trace.spans .filter((s) => s.name.startsWith('llm.')); let totalCost = 0; const breakdown: Record<string, number> = {}; for (const span of llmSpans) { const model = span.attributes.model; const cost = calculateCost( model, span.attributes.tokens_in, span.attributes.tokens_out ); totalCost += cost; breakdown[model] = (breakdown[model] || 0) + cost; } return { totalCost, breakdown, tokenCount: llmSpans.reduce( (sum, s) => sum + s.attributes.tokens_in + s.attributes.tokens_out, 0 ), }; }

Cost Alerting Thresholds

// Per-request cost guard const MAX_COST_PER_REQUEST = 0.50; // $0.50 // Per-user hourly budget const MAX_COST_PER_USER_HOUR = 5.00; // $5.00 // Per-organization daily budget const MAX_COST_PER_ORG_DAY = 500.00; // $500.00 async function checkCostBudgets( cost: number, userId: string, orgId: string ) { // Check per-request if (cost > MAX_COST_PER_REQUEST) { await sendAlert({ severity: 'high', title: `Request cost exceeded: $${cost.toFixed(4)}`, }); } // Check per-user hourly const userHourlyCost = await redis.incrbyfloat( `cost:user:${userId}:${getCurrentHour()}`, cost ); await redis.expire( `cost:user:${userId}:${getCurrentHour()}`, 7200 ); if (userHourlyCost > MAX_COST_PER_USER_HOUR) { await sendAlert({ severity: 'critical', title: `User ${userId} hourly budget exceeded`, }); // Optionally: rate-limit or block the user } }

The Tooling Landscape: LangSmith vs Langfuse vs Arize

Choosing the right observability platform is a critical decision. Here's the honest comparison:

LangSmith

Best for: Teams already using LangChain/LangGraph

// LangSmith integration import { Client } from 'langsmith'; import { traceable } from 'langsmith/traceable'; const client = new Client({ apiKey: process.env.LANGSMITH_API_KEY, }); // Decorator-based tracing const processQuery = traceable( async (query: string) => { const intent = await classifyIntent(query); const results = await searchFlights(intent); return generateResponse(results); }, { name: 'process_query', tags: ['production'] } );

Strengths:

  • Deep LangChain/LangGraph integration (first-party)
  • Built-in prompt playground and versioning
  • Hub for sharing and discovering prompts
  • Strong eval framework with human-in-the-loop
  • Excellent visualization of agent execution graphs

Weaknesses:

  • Vendor lock-in to LangChain ecosystem
  • Closed-source, hosted only (no self-hosting)
  • Pricing scales with trace volume (can get expensive)
  • Limited support for non-LangChain frameworks

Langfuse

Best for: Teams who want open-source, framework-agnostic tracing

// Langfuse v5+ integration (recommended: @langfuse/tracing) import { observe } from '@langfuse/tracing'; // Decorator-based tracing (simplest approach) const processQuery = observe( { name: 'flight-search-agent' }, async (query: string) => { const intent = await classifyIntent(query); const results = await searchFlights(intent); return generateResponse(results); } ); // Or use the classic Langfuse client for granular control: import Langfuse from 'langfuse'; const langfuse = new Langfuse({ publicKey: process.env.LANGFUSE_PUBLIC_KEY, secretKey: process.env.LANGFUSE_SECRET_KEY, }); const trace = langfuse.trace({ name: 'flight-search-agent', userId: 'user-123', metadata: { environment: 'production' }, }); const generation = trace.generation({ name: 'classify-intent', model: 'gpt-4.1-mini', input: [{ role: 'user', content: query }], output: response, usage: { promptTokens: 150, completionTokens: 30, }, });

Strengths:

  • Open-source (MIT license), self-hostable
  • Framework-agnostic (works with any LLM provider)
  • Built-in cost tracking and token analytics
  • Prompt management and versioning
  • Growing ecosystem of integrations
  • Generous free tier

Weaknesses:

  • Smaller community than LangSmith
  • Self-hosting requires infrastructure management
  • Evaluation features less mature than LangSmith
  • UI less polished (improving rapidly)

Arize Phoenix

Best for: Teams with ML/data science backgrounds who need deep analysis

// Arize Phoenix integration import { trace as otelTrace } from '@opentelemetry/api'; import { registerInstrumentations } from '@opentelemetry/instrumentation'; import { OpenAIInstrumentation } from '@arizeai/openinference-instrumentation-openai'; // Phoenix uses OpenTelemetry natively registerInstrumentations({ instrumentations: [new OpenAIInstrumentation()], });

Strengths:

  • Built on OpenTelemetry (no proprietary lock-in)
  • Excellent embedding visualization and drift detection
  • Strong retrieval (RAG) analysis tools
  • Local-first development experience (Phoenix runs locally)
  • Best-in-class for debugging retrieval quality

Weaknesses:

  • Steeper learning curve
  • Less focus on agent orchestration tracing
  • Smaller ecosystem of direct integrations
  • Enterprise features require Arize cloud

Comparison Matrix

FeatureLangSmithLangfuseArize Phoenix
Open SourceโŒโœ… MITโœ… (Phoenix)
Self-HostingโŒโœ…โœ… (Phoenix)
LangChain Integrationโญโญโญโญโญโญ
Framework-Agnosticโญโญโญโญโญโญโญ
Cost Trackingโญโญโญโญโญโญโญ
Eval Frameworkโญโญโญโญโญโญโญ
RAG Analysisโญโญโญโญโญโญ
Prompt Managementโญโญโญโญโญโญ
Embedding Analysisโญโญโญโญโญ
Pricing (startup)$$$Free/$Free/$

Production Patterns and Anti-Patterns

Pattern 1: The Circuit Breaker

Prevent runaway agents from burning through your budget:

class AgentCircuitBreaker { private tokenCount = 0; private llmCalls = 0; private toolCalls = 0; private startTime: number; constructor( private limits: { maxTokens: number; maxLLMCalls: number; maxToolCalls: number; maxDurationMs: number; } ) { this.startTime = Date.now(); } check(event: { type: 'llm' | 'tool'; tokens?: number }) { if (event.type === 'llm') { this.llmCalls++; this.tokenCount += event.tokens || 0; } else { this.toolCalls++; } const elapsed = Date.now() - this.startTime; if (this.tokenCount > this.limits.maxTokens) { throw new CircuitBreakerError( `Token limit exceeded: ${this.tokenCount}` ); } if (this.llmCalls > this.limits.maxLLMCalls) { throw new CircuitBreakerError( `LLM call limit exceeded: ${this.llmCalls}` ); } if (this.toolCalls > this.limits.maxToolCalls) { throw new CircuitBreakerError( `Tool call limit exceeded: ${this.toolCalls}` ); } if (elapsed > this.limits.maxDurationMs) { throw new CircuitBreakerError( `Duration limit exceeded: ${elapsed}ms` ); } } } // Usage const breaker = new AgentCircuitBreaker({ maxTokens: 50000, maxLLMCalls: 20, maxToolCalls: 30, maxDurationMs: 60000, // 1 minute }); // In your agent loop for (const step of agentSteps) { breaker.check({ type: step.type, tokens: step.tokenUsage, }); await executeStep(step); }

Pattern 2: Trace-Based Debugging Workflow

When something breaks, follow this systematic approach:

1. DETECT: Automated eval flags quality drop
   โ†“
2. IDENTIFY: Filter traces by low eval scores
   โ†“
3. COMPARE: Side-by-side with successful traces
   โ†“
4. ISOLATE: Find the divergence point
   โ†“
5. ROOT CAUSE: Examine inputs/outputs at that span
   โ†“
6. FIX: Update prompt, context, or tool config
   โ†“
7. VALIDATE: Run evals on the fix against test dataset

Anti-Pattern 1: Logging Everything

Don't log every token of every request.

// โŒ Don't do this logger.info('LLM Response', { fullPrompt: systemPrompt + userMessage + context, // 50KB fullResponse: completion, // 10KB metadata: entireTraceObject, // 5KB }); // Result: 65KB per request ร— 1M requests/day = 65GB/day // โœ… Do this instead logger.info('LLM Response', { traceId: trace.id, // Link to detailed trace model: 'gpt-4.1-mini', tokensIn: 150, tokensOut: 30, cost: 0.0001, latencyMs: 200, evalScore: 0.95, }); // Result: 200 bytes per request ร— 1M requests/day = 200MB/day

Anti-Pattern 2: Treating LLM Errors Like HTTP Errors

// โŒ Misleading: HTTP 200 but agent response is terrible if (response.status === 200) { metrics.increment('agent.success'); } // โœ… Correct: Measure actual quality const evalScore = await quickEval(response.body); if (evalScore > 0.7) { metrics.increment('agent.quality.good'); } else { metrics.increment('agent.quality.poor'); // This is the real "error" โ€” trigger investigation }

Anti-Pattern 3: No Baseline

// โŒ Alert: "Eval score is 0.72" โ€” Is that good? Bad? // โœ… Establish baseline first // Week 1-2: Collect eval scores without alerting // Week 3: Calculate P50, P90, P99 baselines // Week 4+: Alert on deviations from baseline const baseline = { accuracy: { p50: 0.85, p90: 0.92, p99: 0.97 }, relevance: { p50: 0.90, p90: 0.95, p99: 0.99 }, latency: { p50: 2000, p90: 5000, p99: 10000 }, }; function shouldAlert( dimension: string, value: number ): boolean { const b = baseline[dimension]; return value < b.p50 * 0.8; // Alert if 20% below median }

The Minimum Viable Observability Stack

If you're starting from scratch, here's the fastest path to production-grade observability:

Day 1: Basic Instrumentation

// 1. Install Langfuse (fastest to get started) // npm install langfuse import Langfuse from 'langfuse'; const langfuse = new Langfuse(); // 2. Wrap your agent's main function async function runAgent(query: string, userId: string) { const trace = langfuse.trace({ name: 'agent-run', userId, input: query, }); // Your existing agent code here... trace.update({ output: response }); await langfuse.flushAsync(); }

Week 1: Add Cost Tracking

// Track costs on every LLM call const generation = trace.generation({ name: 'main-llm-call', model: 'gpt-4.1-mini', input: messages, output: completion, usage: { promptTokens: usage.prompt_tokens, completionTokens: usage.completion_tokens, }, // Langfuse auto-calculates cost from token counts });

Week 2: Add Deterministic Guards

// Add the circuit breaker from Pattern 1 // Add empty-response detection // Add loop detection // Set up Slack/PagerDuty alerts for guard violations

Week 4: Add LLM-as-Judge Evals

// Run on 10% of production traces // Start with two dimensions: accuracy + relevance // Establish baselines before activating alerts

Month 2: Graduate to Full Stack

Langfuse (Tracing + Cost)
  + Custom Eval Pipeline (Quality)
  + Grafana/Datadog (Infrastructure)
  + PagerDuty (Alerting)

The LLM Observability Checklist

Before every production deployment of an AI feature:

  • Every LLM call is instrumented with trace context
  • Token counts and model names are captured on every call
  • Tool calls have input/output logging
  • Cost tracking is active at per-request and per-user levels
  • Circuit breaker limits are set (tokens, calls, duration)
  • Deterministic guards are running on 100% of traces
  • LLM-as-Judge evals are running on sampled traces
  • Baselines are established for quality metrics
  • Alerts are configured for guard violations and quality drops
  • Full prompt/completion logging uses sampling, not 100% capture
  • PII scrubbing is applied before logging prompts
  • Dashboard shows real-time cost, quality, and latency trends
  • Trace retention policy is defined (30-90 days typical)

AI agents are not deterministic software. Monitoring them like traditional APIs will give you a false sense of security until the day they quietly go haywire and you have no way to figure out why. The observability patterns in this guide have been battle-tested across production systems handling millions of agent interactions daily. The core insight is simple: if you can't trace the reasoning, you can't debug the failure. Instrument everything, evaluate continuously, and never trust a green dashboard when your agent's output quality hasn't been measured.

LLMobservabilityAI agentstracingLangSmithLangfusemonitoringproductiondebuggingMLOps

Explore Related Tools

Try these free developer tools from Pockit