Services

Agent Observability

Full-stack monitoring, tracing, and debugging for production AI agents


Standard APM tools miss most of what matters in AI systems: prompt content, token budgets, retrieval quality, and reasoning chains. Our observability stack is designed from the ground up for agents — capturing every signal that matters and surfacing it when you need it.

Distributed Tracing#

What Gets Traced Automatically#

Every instrumented agent records:

Span TypeCaptured Data
LLM requestModel, prompt tokens, completion tokens, latency, cost, finish reason
Tool callTool name, input arguments, output, duration, error
Memory read/writeQuery, results, namespace, latency
Agent handoffFrom/to agent, context passed, reason
RetrievalQuery, top-k results, scores, source documents
Workflow stepStep name, status, retry count, checkpoint state

Trace Visualization#

  • Waterfall timeline showing parallelism and bottlenecks
  • Cost and token breakdown per span
  • Input/output diff viewer for LLM calls
  • Side-by-side comparison of two trace runs

Sampling#

  • Always-sample for errors, slow traces, and high-cost runs
  • Probabilistic for baseline traffic (configurable rate)
  • Tail-based sampling to keep interesting traces regardless of outcome
  • Zero code changes required to change sampling strategy

Cost Analytics#

Track LLM spend with the granularity of software observability tools.

Dashboards include:

  • Daily / weekly spend by model, agent, and workflow
  • Cost per successful task completion
  • Token efficiency ratio (output tokens per dollar)
  • Spend forecasting based on current trajectory
  • Per-user and per-team cost attribution

Alerts:

  • Budget threshold warnings (50%, 80%, 100%)
  • Cost spike detection (>2× day-over-day)
  • High-cost trace flagging for manual review

Quality Signals#

Observability for AI goes beyond latency and error rates.

Hallucination Detection#

  • Confidence scoring on factual claims using a lightweight verifier model
  • Source attribution checks (did the agent cite something it wasn't given?)
  • Flagging answers that contradict retrieved context

Retrieval Quality#

  • Precision@k tracking for RAG pipelines
  • Context relevance scoring per retrieved chunk
  • Unused context detection (retrieval cost with no impact on output)

Response Quality#

  • Structured output validation failures tracked per schema
  • Refusal and safety filter activation rates
  • Response length distribution and truncation events

Debugging Tools#

Trace Replay#

Re-execute any historical trace with modified inputs, prompts, or model parameters. Compare outputs side-by-side. No need to reconstruct the full context manually.

Session Inspection#

Full conversation view with:

  • User messages and agent responses
  • Internal reasoning steps (chain-of-thought)
  • All tool calls and their results
  • Memory reads at each turn
  • Token count per message

Prompt Versioning#

Track prompt changes across deployments. A/B compare prompt versions by cost, latency, and quality metrics on production traffic.


Alerting#

Alert TypeDefault ThresholdConfigurable
Error rate spike>5% over 5 minYes
p95 latency degradation>2× baselineYes
LLM cost overrun>150% of budgetYes
Tool failure rate>10% over 1 hourYes
Hallucination score>0.3 averageYes
Agent stuck / timeout>configured timeoutYes

Delivers to: Slack, PagerDuty, OpsGenie, email, and custom webhooks.


Integration#

Zero-Code Instrumentation#

Drop in our SDK and all LLM calls are automatically traced:

1
import { instrument } from '@assistance/observe'
2
3
instrument({
4
serviceName: 'my-agent',
5
endpoint: 'https://ingest.observe.assistance.bg',
6
apiKey: process.env.OBSERVE_API_KEY,
7
})
8
// All OpenAI, Anthropic, and LangChain calls now traced

Manual Spans#

1
import { tracer } from '@assistance/observe'
2
3
const span = tracer.startSpan('custom-retrieval')
4
const results = await myVectorDb.search(query)
5
span.setAttributes({ resultCount: results.length })
6
span.end()

Framework Support#

Works out of the box with LangChain, LangGraph, CrewAI, AutoGen, custom agent loops, and any framework that uses standard LLM client libraries.


Data Retention#

TierRetentionResolution
Full traces30 daysRaw
Aggregated metrics13 months1-minute
Cost data24 monthsPer-request
Anomaly events24 monthsRaw

Getting Started#