Tracing LLM Agents: Multi-Span Traces, Tool Calls, and Where Latency Hides

A single LLM call is easy to monitor — log the prompt, the response, the latency. An agent that calls tools, retrieves documents, and delegates to sub-agents is a pipeline, and pipelines need structured traces. The unit of observability for agents is the multi-span trace: a tree of spans capturing every step, its duration, its cost, and its parent-child relationship. Here is how to instrument agent traces, where latency actually hides, and what the trace reveals that flat logs cannot.

By the LLM Academy team · Reviewed July 2026 · Patterns apply to LangSmith, Langfuse, OpenTelemetry GenAI, and Arize Phoenix

Why agents break flat logging

Traditional LLM monitoring logs each model call as an independent event: prompt in, response out, latency, tokens, cost. This works for a single-shot completion. It collapses completely for agents. When a user asks an agent a question, the agent might: (1) call the LLM to plan, (2) call a search tool, (3) call the LLM to interpret results, (4) call a code-execution tool, (5) call the LLM to synthesize the answer. That is 3-5 LLM calls and 2+ tool calls, all in service of one user-visible response.

Flat logging shows five disconnected events with no way to know they belong to the same user turn. You cannot answer "how long did this agent turn take end-to-end?" or "which step was the bottleneck?" or "did the search tool return garbage that caused the bad final answer?" without the parent-child structure that a trace provides.

The trace is the unit

For agent observability, the fundamental unit is the trace (one user turn), not the LLM call. Every LLM call, tool call, and retrieval is a span within that trace. Optimize your dashboards around trace-level metrics (end-to-end latency, total cost per turn, tool success rate) before span-level ones.

The span hierarchy

A well-instrumented agent trace is a tree. The root span is the user request. Children are the major phases — planning, tool execution, synthesis. Each tool execution spawns its own children for the tool's internal steps (a retrieval tool might have sub-spans for embedding, vector search, and reranking). Sub-agents appear as nested traces within the parent trace.

# Trace structure for a research-agent turn
TRACE: user asks "summarize Q3 revenue drivers" (12.4s total)
├─ span: planner LLM call              (1.2s, 850 tok, $0.012)
├─ span: tool: search_documents       (2.8s)
│   ├─ span: embed query              (0.1s)
│   ├─ span: vector search            (2.5s)  ← bottleneck
│   └─ span: rerank results           (0.2s)
├─ span: tool: get_financials         (1.1s)
├─ span: interpreter LLM call         (2.1s, 2200 tok, $0.022)
├─ span: tool: code_exec (chart)      (3.8s)
└─ span: synthesizer LLM call         (1.4s, 1500 tok, $0.015)
# Total cost: $0.049. Total latency: 12.4s.
# Vector search dominates at 2.5s (20% of total).

This structure makes optimization actionable. Without it, you know the turn took 12.4 seconds; with it, you know the vector search took 2.5 of those seconds and is the next thing to optimize. Every tracing platform — LangSmith, Langfuse, Phoenix, OpenTelemetry — renders this tree and lets you drill into any span.

Where latency actually hides

When teams first instrument agent traces, the consistent finding is that the LLM call is only 30-50% of total latency. The rest is tool I/O and retrieval. Here are the most common latency sinks, in order of how often trace analysis surfaces them.

Latency sinkTypical costFix
Sequential tool calls that could be parallel2-5s wastedRun independent tools concurrently
Slow vector search (brute-force, no ANN index)1-3s per retrievalUse HNSW or IVF; cache embeddings
Unnecessary re-planning LLM calls1-2s per redundant callCache plans for similar intents
Verbose system prompts inflating prefill0.5-1.5sTrim prompt; enable prefix caching
Synchronous external API calls (no timeout)Variable, unboundedAdd timeouts + circuit breakers
Re-reading full context per turn instead of cachingCompounds over turnsUse KV cache reuse / RadixAttention
The parallelism opportunity

The single highest-ROI optimization trace analysis reveals is parallelizing independent tool calls. If your agent calls a search tool and a database tool that don't depend on each other, running them concurrently cuts that phase's latency in half. Many agent frameworks default to sequential execution; traces make the waste obvious.

Instrumenting traces with Langfuse

Langfuse (open-source, self-hostable) is the most framework-agnostic option. The instrumentation is a decorator-based API that automatically creates spans and nests them correctly.

# Langfuse agent tracing (Python SDK)
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()                          # creates the root trace span
def research_agent(query: str) -> str:
    plan = plan_step(query)         # each @observe() child auto-nests
    docs = search_tool(plan)
    answer = synthesize(query, docs)
    return answer

@observe(as_type="generation")      # marks this as an LLM call
def plan_step(query: str) -> str:
    return llm.complete(PLAN_PROMPT.format(query=query))

@observe()                          # marks this as a tool call
def search_tool(plan: str) -> list:
    return vector_db.search(plan)
# Every span auto-records duration, tokens (for generations),
# and cost. View the tree in the Langfuse UI.

The key principle is that each function that should appear in the trace gets an @observe() decorator, and Langfuse handles the parent-child nesting via a context variable. You do not manually pass trace IDs around. See our Self-host Langfuse tutorial for the full deployment setup.

OpenTelemetry and GenAI semantic conventions

If your team already runs OpenTelemetry for backend observability (Datadog, Honeycomb, Grafana Tempo), you do not need a separate LLM tracing tool. The OpenTelemetry GenAI semantic conventions (stabilized in 2024-2025) define standard attributes for LLM spans: gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and so on. Instrument your agent with OTel spans using these attributes and they render natively in your existing observability stack.

The trade-off: OTel requires more manual instrumentation than Langfuse or LangSmith's decorator magic, and the GenAI conventions are newer with less community tooling. But for teams committed to a single observability vendor, it avoids running a parallel LLM-specific platform. Our platform comparison breaks down when each approach wins.

Metrics that matter for agents

Beyond per-trace debugging, aggregate trace metrics drive operational decisions. The dashboards that matter for agent workloads differ from simple LLM-call dashboards.

MetricWhat it tells youHealthy range
P50/P99 trace latencyEnd-to-end user experienceP50 <5s, P99 <15s (varies by use case)
Cost per traceSpend efficiency per user turnTrack trend, alert on spikes
Tool success rateReliability of non-LLM steps>95% per tool
LLM calls per traceAgent efficiency (fewer = cheaper, faster)Lower is better; alert on creep
Trace error rateAgent failed to produce a valid response<2%
Token cost per traceBreakdown of where tokens goTrack by span type
Watch the LLM-calls-per-trace trend

If your average LLM calls per trace creeps from 3.2 to 4.5 over a month, your agent is getting chattier — usually because of fallback logic or a degrading tool that triggers retries. This metric is an early warning signal for both cost and latency regressions.

Connecting traces to evals

Traces tell you what happened; evals tell you whether it was good. The two are most powerful when connected: tag each trace with its eval scores (or run LLM-as-a-Judge asynchronously on sampled traces), and you can slice latency and cost by quality. "The traces that scored below 4/5 on helpfulness also cost 40% more — they were retrying failed tool calls" is the kind of insight that only trace-plus-eval surfaces.

All three major platforms (LangSmith, Langfuse, Phoenix) support attaching eval scores to traces. The pattern is: run the trace in production, asynchronously score the final output with an LLM judge or rubric, and write the score back as a trace attribute. Then your dashboards can filter and aggregate by quality, not just by operational metrics.

FAQ

What is an LLM agent trace?

A tree of spans capturing every step in an agent's execution — the initial LLM call, tool invocations, retrieval lookups, sub-agent delegations, and final response. Each span records duration, input, output, token cost, and parent-child relationship, letting you localize latency and cost to the responsible step.

Why do agent traces need multi-span structure?

Because one user turn triggers multiple LLM and tool calls in sequence or parallel. A flat log shows disconnected events; a multi-span trace captures the parent-child hierarchy (request → planning → tool calls → synthesis), making it possible to see which step dominated latency or caused a bad outcome.

Where does latency hide in agent pipelines?

Mostly in non-LLM steps: sequential tool calls that could be parallelized (biggest opportunity), slow vector searches, unnecessary re-planning calls, verbose system prompts inflating prefill, and synchronous external APIs without timeouts. Trace analysis typically shows the LLM call is only 30-50% of total latency.

LangSmith vs Langfuse vs OpenTelemetry for agent tracing?

LangSmith is most polished for LangChain-native agents but proprietary SaaS. Langfuse is open-source, framework-agnostic, and self-hostable — best for vendor independence. OpenTelemetry with GenAI semantic conventions integrates with existing observability stacks (Datadog, Honeycomb) — best if your team already runs OTel. See our platform comparison.

Related deep dives

Sources

Latency breakdowns are workload-dependent. The "LLM is 30-50% of latency" figure reflects typical tool-using agent pipelines; your mix will vary based on which tools you call and how.