[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Dev.to / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

OpenTelemetry has become the de facto standard for distributed tracing, metrics, and logs, with stable 2025–early 2026 semantic conventions for generative AI such as token usage and model parameters.
For LLM agent systems, OTel enables trace continuity across multi-step workflows (e.g., LLM calls plus vector-store retrievals) by representing each step as a span within a single trace.
The LLM-focused semantic attributes (e.g., input/output tokens) allow per-request token and cost attribution without requiring a separate billing integration.
A self-hosted architecture can centralize collection via an OTel Collector that receives OTLP/gRPC telemetry from agents and exports traces to Tempo/Jaeger, metrics to Prometheus, and dashboards to Grafana.
Self-hosting is presented as cost-effective for small setups (e.g., fewer than five agents) versus managed platforms that may charge per span or per seat.

I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.

Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.

Why OpenTelemetry for LLM Workloads?

OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.

For LLM workloads specifically, OTel gives you a few things that matter:

Trace continuity across agent steps. When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.

Token and cost attribution. The gen_ai semantic conventions include attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, which let you track per-request costs without bolting on a separate billing layer.

Vendor neutrality. Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.

The Self-Hosted Stack

My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:

[Your LLM Agents]
       |
       v
[OTel Collector]  ← receives traces via OTLP/gRPC
       |
       v
[Tempo / Jaeger]  ← trace storage
[Prometheus]      ← metrics storage
[Grafana]         ← visualization

If you've looked at the self-hosted vs managed cost comparison, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.

Setting Up the OTel Collector

The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the production deployment guide covers Docker Compose configs and health checks.

Instrumenting LLM Calls

The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.

First, install the packages:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-requests

Then set up a tracer and wrap your LLM calls:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-agent")

def call_llm(prompt, model="claude-sonnet-4-20250514"):
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)

        response = your_llm_client.complete(prompt=prompt, model=model)

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        span.set_attribute("gen_ai.response.model", response.model)

        return response.content

The key is using the gen_ai.* semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.

Tracing Multi-Step Agent Workflows

Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:

def run_agent(task):
    with tracer.start_as_current_span("agent.run") as parent:
        parent.set_attribute("agent.task", task)

        # Step 1: retrieve context
        with tracer.start_as_current_span("retrieval.vector_search"):
            context = search_vector_store(task)

        # Step 2: call LLM with context
        result = call_llm(f"Context: {context}
Task: {task}")

        # Step 3: maybe call a tool
        if needs_tool_call(result):
            with tracer.start_as_current_span("tool.execute") as tool_span:
                tool_span.set_attribute("tool.name", "web_search")
                tool_result = execute_tool(result)
                result = call_llm(f"Tool result: {tool_result}
Original task: {task}")

        return result

When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.

What You Actually See in the Dashboard

Once everything is wired up, your self-hosted observability dashboard shows you:

Latency breakdown per agent step — which spans are slow, and whether it's network or model inference
Token usage over time — catch runaway prompts before they drain your API budget
Error rates by model/provider — spot degraded model endpoints early
Trace search — find the exact trace where an agent went off the rails

For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.

Rough Edges and Honest Takes

A few things that are still annoying in 2026:

Auto-instrumentation for LLM SDKs is patchy. The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.

Trace volume can surprise you. Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.

Grafana dashboards take time to build. The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.

Wrapping Up

OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.

If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.

Black Hat USA

AI Business

Black Hat Asia

AI Business

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

[Patterns] AI Agent Error Handling That Actually Works

Dev.to

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Key Points

Why OpenTelemetry for LLM Workloads?

The Self-Hosted Stack

Setting Up the OTel Collector

Instrumenting LLM Calls

Tracing Multi-Step Agent Workflows

What You Actually See in the Dashboard

Rough Edges and Honest Takes

Wrapping Up

Related Articles

Black Hat USA

Black Hat Asia

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

[Patterns] AI Agent Error Handling That Actually Works

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer