I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.
Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.
Why OpenTelemetry for LLM Workloads?
OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.
For LLM workloads specifically, OTel gives you a few things that matter:
Trace continuity across agent steps. When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.
Token and cost attribution. The gen_ai semantic conventions include attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, which let you track per-request costs without bolting on a separate billing layer.
Vendor neutrality. Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.
The Self-Hosted Stack
My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:
[Your LLM Agents]
|
v
[OTel Collector] ← receives traces via OTLP/gRPC
|
v
[Tempo / Jaeger] ← trace storage
[Prometheus] ← metrics storage
[Grafana] ← visualization
If you've looked at the self-hosted vs managed cost comparison, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.
Setting Up the OTel Collector
The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 512
attributes:
actions:
- key: deployment.environment
value: production
action: upsert
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the production deployment guide covers Docker Compose configs and health checks.
Instrumenting LLM Calls
The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.
First, install the packages:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-requests
Then set up a tracer and wrap your LLM calls:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Initialize
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-agent")
def call_llm(prompt, model="claude-sonnet-4-20250514"):
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.max_tokens", 1024)
response = your_llm_client.complete(prompt=prompt, model=model)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
span.set_attribute("gen_ai.response.model", response.model)
return response.content
The key is using the gen_ai.* semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.
Tracing Multi-Step Agent Workflows
Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:
def run_agent(task):
with tracer.start_as_current_span("agent.run") as parent:
parent.set_attribute("agent.task", task)
# Step 1: retrieve context
with tracer.start_as_current_span("retrieval.vector_search"):
context = search_vector_store(task)
# Step 2: call LLM with context
result = call_llm(f"Context: {context}
Task: {task}")
# Step 3: maybe call a tool
if needs_tool_call(result):
with tracer.start_as_current_span("tool.execute") as tool_span:
tool_span.set_attribute("tool.name", "web_search")
tool_result = execute_tool(result)
result = call_llm(f"Tool result: {tool_result}
Original task: {task}")
return result
When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.
What You Actually See in the Dashboard
Once everything is wired up, your self-hosted observability dashboard shows you:
- Latency breakdown per agent step — which spans are slow, and whether it's network or model inference
- Token usage over time — catch runaway prompts before they drain your API budget
- Error rates by model/provider — spot degraded model endpoints early
- Trace search — find the exact trace where an agent went off the rails
For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.
Rough Edges and Honest Takes
A few things that are still annoying in 2026:
Auto-instrumentation for LLM SDKs is patchy. The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.
Trace volume can surprise you. Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.
Grafana dashboards take time to build. The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.
Wrapping Up
OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.
If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.



![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)