Agent Observability
Full-stack monitoring, tracing, and debugging for production AI agents
Standard APM tools miss most of what matters in AI systems: prompt content, token budgets, retrieval quality, and reasoning chains. Our observability stack is designed from the ground up for agents — capturing every signal that matters and surfacing it when you need it.
Distributed Tracing#
What Gets Traced Automatically#
Every instrumented agent records:
| Span Type | Captured Data |
|---|---|
| LLM request | Model, prompt tokens, completion tokens, latency, cost, finish reason |
| Tool call | Tool name, input arguments, output, duration, error |
| Memory read/write | Query, results, namespace, latency |
| Agent handoff | From/to agent, context passed, reason |
| Retrieval | Query, top-k results, scores, source documents |
| Workflow step | Step name, status, retry count, checkpoint state |
Trace Visualization#
- Waterfall timeline showing parallelism and bottlenecks
- Cost and token breakdown per span
- Input/output diff viewer for LLM calls
- Side-by-side comparison of two trace runs
Sampling#
- Always-sample for errors, slow traces, and high-cost runs
- Probabilistic for baseline traffic (configurable rate)
- Tail-based sampling to keep interesting traces regardless of outcome
- Zero code changes required to change sampling strategy
Cost Analytics#
Track LLM spend with the granularity of software observability tools.
Dashboards include:
- Daily / weekly spend by model, agent, and workflow
- Cost per successful task completion
- Token efficiency ratio (output tokens per dollar)
- Spend forecasting based on current trajectory
- Per-user and per-team cost attribution
Alerts:
- Budget threshold warnings (50%, 80%, 100%)
- Cost spike detection (>2× day-over-day)
- High-cost trace flagging for manual review
Quality Signals#
Observability for AI goes beyond latency and error rates.
Hallucination Detection#
- Confidence scoring on factual claims using a lightweight verifier model
- Source attribution checks (did the agent cite something it wasn't given?)
- Flagging answers that contradict retrieved context
Retrieval Quality#
- Precision@k tracking for RAG pipelines
- Context relevance scoring per retrieved chunk
- Unused context detection (retrieval cost with no impact on output)
Response Quality#
- Structured output validation failures tracked per schema
- Refusal and safety filter activation rates
- Response length distribution and truncation events
Debugging Tools#
Trace Replay#
Re-execute any historical trace with modified inputs, prompts, or model parameters. Compare outputs side-by-side. No need to reconstruct the full context manually.
Session Inspection#
Full conversation view with:
- User messages and agent responses
- Internal reasoning steps (chain-of-thought)
- All tool calls and their results
- Memory reads at each turn
- Token count per message
Prompt Versioning#
Track prompt changes across deployments. A/B compare prompt versions by cost, latency, and quality metrics on production traffic.
Alerting#
| Alert Type | Default Threshold | Configurable |
|---|---|---|
| Error rate spike | >5% over 5 min | Yes |
| p95 latency degradation | >2× baseline | Yes |
| LLM cost overrun | >150% of budget | Yes |
| Tool failure rate | >10% over 1 hour | Yes |
| Hallucination score | >0.3 average | Yes |
| Agent stuck / timeout | >configured timeout | Yes |
Delivers to: Slack, PagerDuty, OpsGenie, email, and custom webhooks.
Integration#
Zero-Code Instrumentation#
Drop in our SDK and all LLM calls are automatically traced:
1import { instrument } from '@assistance/observe'23instrument({4 serviceName: 'my-agent',5 endpoint: 'https://ingest.observe.assistance.bg',6 apiKey: process.env.OBSERVE_API_KEY,7})8// All OpenAI, Anthropic, and LangChain calls now tracedManual Spans#
1import { tracer } from '@assistance/observe'23const span = tracer.startSpan('custom-retrieval')4const results = await myVectorDb.search(query)5span.setAttributes({ resultCount: results.length })6span.end()Framework Support#
Works out of the box with LangChain, LangGraph, CrewAI, AutoGen, custom agent loops, and any framework that uses standard LLM client libraries.
Data Retention#
| Tier | Retention | Resolution |
|---|---|---|
| Full traces | 30 days | Raw |
| Aggregated metrics | 13 months | 1-minute |
| Cost data | 24 months | Per-request |
| Anomaly events | 24 months | Raw |
Getting Started#
Install the SDK in under 10 minutes. We'll walk you through instrumenting your first agent and setting up your dashboards.
Set up agent observability →