Tracing LLM Agents: Multi-Span Traces, Tool Calls, and Where Latency Hides

Q: What is an LLM agent trace?

An LLM agent trace is a tree of spans representing every step in an agent's execution: the initial LLM call, tool invocations, retrieval lookups, sub-agent delegations, and the final response. Each span records its duration, input, output, token cost, and parent-child relationship. The trace lets you see which step dominated latency, which tool failed, and how the agent spent its token budget.

Q: Why do agent traces need multi-span structure?

A single flat LLM-call log cannot represent agent workflows, where one user turn triggers multiple LLM calls, tool calls, and sub-agent runs in sequence or parallel. Multi-span traces capture the parent-child hierarchy (the user request spawns a planning span, which spawns tool-call spans, which each spawn verification spans), making it possible to localize latency and cost to the specific step responsible.

Q: Where does latency hide in agent pipelines?

The most common latency sinks are: (1) sequential tool calls that could be parallelized; (2) retrieval steps with slow vector searches over large indexes; (3) unnecessary re-planning LLM calls when the plan could be cached; (4) verbose system prompts inflating prefill time. Trace analysis typically reveals that the LLM call is only 30-50% of total latency — the rest is tool I/O and retrieval.

Q: LangSmith vs Langfuse vs OpenTelemetry for agent tracing?

LangSmith is the most polished for LangChain-native agents but is proprietary and SaaS-only. Langfuse is open-source, framework-agnostic, and self-hostable — the best choice for vendor independence. OpenTelemetry (with GenAI semantic conventions) integrates with existing observability stacks (Datadog, Honeycomb) and is the right choice if your team already runs OTel. See our platform comparison for details.

A single LLM call is easy to monitor — log the prompt, the response, the latency. An agent that calls tools, retrieves documents, and delegates to sub-agents is a pipeline, and pipelines need structured traces. The unit of observability for agents is the multi-span trace: a tree of spans capturing every step, its duration, its cost, and its parent-child relationship. Here is how to instrument agent traces, where latency actually hides, and what the trace reveals that flat logs cannot.

By the LLM Academy team · Reviewed July 2026 · Patterns apply to LangSmith, Langfuse, OpenTelemetry GenAI, and Arize Phoenix

Why agents break flat logging

Traditional LLM monitoring logs each model call as an independent event: prompt in, response out, latency, tokens, cost. This works for a single-shot completion. It collapses completely for agents. When a user asks an agent a question, the agent might: (1) call the LLM to plan, (2) call a search tool, (3) call the LLM to interpret results, (4) call a code-execution tool, (5) call the LLM to synthesize the answer. That is 3-5 LLM calls and 2+ tool calls, all in service of one user-visible response.

Flat logging shows five disconnected events with no way to know they belong to the same user turn. You cannot answer "how long did this agent turn take end-to-end?" or "which step was the bottleneck?" or "did the search tool return garbage that caused the bad final answer?" without the parent-child structure that a trace provides.

The trace is the unit

For agent observability, the fundamental unit is the trace (one user turn), not the LLM call. Every LLM call, tool call, and retrieval is a span within that trace. Optimize your dashboards around trace-level metrics (end-to-end latency, total cost per turn, tool success rate) before span-level ones.

The span hierarchy

A well-instrumented agent trace is a tree. The root span is the user request. Children are the major phases — planning, tool execution, synthesis. Each tool execution spawns its own children for the tool's internal steps (a retrieval tool might have sub-spans for embedding, vector search, and reranking). Sub-agents appear as nested traces within the parent trace.

# Trace structure for a research-agent turn
TRACE: user asks "summarize Q3 revenue drivers" (12.4s total)
├─ span: planner LLM call              (1.2s, 850 tok, $0.012)
├─ span: tool: search_documents       (2.8s)
│   ├─ span: embed query              (0.1s)
│   ├─ span: vector search            (2.5s)  ← bottleneck
│   └─ span: rerank results           (0.2s)
├─ span: tool: get_financials         (1.1s)
├─ span: interpreter LLM call         (2.1s, 2200 tok, $0.022)
├─ span: tool: code_exec (chart)      (3.8s)
└─ span: synthesizer LLM call         (1.4s, 1500 tok, $0.015)
# Total cost: $0.049. Total latency: 12.4s.
# Vector search dominates at 2.5s (20% of total).

This structure makes optimization actionable. Without it, you know the turn took 12.4 seconds; with it, you know the vector search took 2.5 of those seconds and is the next thing to optimize. Every tracing platform — LangSmith, Langfuse, Phoenix, OpenTelemetry — renders this tree and lets you drill into any span.

Where latency actually hides

When teams first instrument agent traces, the consistent finding is that the LLM call is only 30-50% of total latency. The rest is tool I/O and retrieval. Here are the most common latency sinks, in order of how often trace analysis surfaces them.

Latency sink	Typical cost	Fix
Sequential tool calls that could be parallel	2-5s wasted	Run independent tools concurrently
Slow vector search (brute-force, no ANN index)	1-3s per retrieval	Use HNSW or IVF; cache embeddings
Unnecessary re-planning LLM calls	1-2s per redundant call	Cache plans for similar intents
Verbose system prompts inflating prefill	0.5-1.5s	Trim prompt; enable prefix caching
Synchronous external API calls (no timeout)	Variable, unbounded	Add timeouts + circuit breakers
Re-reading full context per turn instead of caching	Compounds over turns	Use KV cache reuse / RadixAttention

The parallelism opportunity

The single highest-ROI optimization trace analysis reveals is parallelizing independent tool calls. If your agent calls a search tool and a database tool that don't depend on each other, running them concurrently cuts that phase's latency in half. Many agent frameworks default to sequential execution; traces make the waste obvious.

Instrumenting traces with Langfuse

Langfuse (open-source, self-hostable) is the most framework-agnostic option. The instrumentation is a decorator-based API that automatically creates spans and nests them correctly.

# Langfuse agent tracing (Python SDK)
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()                          # creates the root trace span
def research_agent(query: str) -> str:
    plan = plan_step(query)         # each @observe() child auto-nests
    docs = search_tool(plan)
    answer = synthesize(query, docs)
    return answer

@observe(as_type="generation")      # marks this as an LLM call
def plan_step(query: str) -> str:
    return llm.complete(PLAN_PROMPT.format(query=query))

@observe()                          # marks this as a tool call
def search_tool(plan: str) -> list:
    return vector_db.search(plan)
# Every span auto-records duration, tokens (for generations),
# and cost. View the tree in the Langfuse UI.

The key principle is that each function that should appear in the trace gets an @observe() decorator, and Langfuse handles the parent-child nesting via a context variable. You do not manually pass trace IDs around. See our Self-host Langfuse tutorial for the full deployment setup.

OpenTelemetry and GenAI semantic conventions

If your team already runs OpenTelemetry for backend observability (Datadog, Honeycomb, Grafana Tempo), you do not need a separate LLM tracing tool. The OpenTelemetry GenAI semantic conventions (stabilized in 2024-2025) define standard attributes for LLM spans: gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and so on. Instrument your agent with OTel spans using these attributes and they render natively in your existing observability stack.

The trade-off: OTel requires more manual instrumentation than Langfuse or LangSmith's decorator magic, and the GenAI conventions are newer with less community tooling. But for teams committed to a single observability vendor, it avoids running a parallel LLM-specific platform. Our platform comparison breaks down when each approach wins.

Metrics that matter for agents

Beyond per-trace debugging, aggregate trace metrics drive operational decisions. The dashboards that matter for agent workloads differ from simple LLM-call dashboards.

Metric	What it tells you	Healthy range
P50/P99 trace latency	End-to-end user experience	P50 <5s, P99 <15s (varies by use case)
Cost per trace	Spend efficiency per user turn	Track trend, alert on spikes
Tool success rate	Reliability of non-LLM steps	>95% per tool
LLM calls per trace	Agent efficiency (fewer = cheaper, faster)	Lower is better; alert on creep
Trace error rate	Agent failed to produce a valid response	<2%
Token cost per trace	Breakdown of where tokens go	Track by span type

Watch the LLM-calls-per-trace trend

If your average LLM calls per trace creeps from 3.2 to 4.5 over a month, your agent is getting chattier — usually because of fallback logic or a degrading tool that triggers retries. This metric is an early warning signal for both cost and latency regressions.

Connecting traces to evals

Traces tell you what happened; evals tell you whether it was good. The two are most powerful when connected: tag each trace with its eval scores (or run LLM-as-a-Judge asynchronously on sampled traces), and you can slice latency and cost by quality. "The traces that scored below 4/5 on helpfulness also cost 40% more — they were retrying failed tool calls" is the kind of insight that only trace-plus-eval surfaces.

All three major platforms (LangSmith, Langfuse, Phoenix) support attaching eval scores to traces. The pattern is: run the trace in production, asynchronously score the final output with an LLM judge or rubric, and write the score back as a trace attribute. Then your dashboards can filter and aggregate by quality, not just by operational metrics.

FAQ

What is an LLM agent trace?

A tree of spans capturing every step in an agent's execution — the initial LLM call, tool invocations, retrieval lookups, sub-agent delegations, and final response. Each span records duration, input, output, token cost, and parent-child relationship, letting you localize latency and cost to the responsible step.

Why do agent traces need multi-span structure?

Because one user turn triggers multiple LLM and tool calls in sequence or parallel. A flat log shows disconnected events; a multi-span trace captures the parent-child hierarchy (request → planning → tool calls → synthesis), making it possible to see which step dominated latency or caused a bad outcome.

Where does latency hide in agent pipelines?

Mostly in non-LLM steps: sequential tool calls that could be parallelized (biggest opportunity), slow vector searches, unnecessary re-planning calls, verbose system prompts inflating prefill, and synchronous external APIs without timeouts. Trace analysis typically shows the LLM call is only 30-50% of total latency.

LangSmith vs Langfuse vs OpenTelemetry for agent tracing?

LangSmith is most polished for LangChain-native agents but proprietary SaaS. Langfuse is open-source, framework-agnostic, and self-hostable — best for vendor independence. OpenTelemetry with GenAI semantic conventions integrates with existing observability stacks (Datadog, Honeycomb) — best if your team already runs OTel. See our platform comparison.

Related deep dives

LangSmith vs Langfuse vs Phoenix — which tracing platform fits your stack
Self-host Langfuse — deployment guide for the open-source tracing platform
LLM-as-a-Judge — how to score traces for quality and connect evals to observability
Hallucination Detection in Production — using traces to catch when agents fabricate
Cost Monitoring Dashboards — per-span cost data rolls up into team-level spend visibility
Prefix Caching — reduces the prefill latency that traces often surface

Sources

OpenTelemetry, "GenAI Semantic Conventions," 2024-2025
Langfuse documentation, "Tracing and Observability for LLM Applications," 2025
LangSmith documentation, "Agent Tracing and Evaluation," 2025
Arize AI, "Phoenix: AI Observability and Evaluation," 2025
Databricks, "Observing Agent Workflows in Production," 2025
LangChain blog, "Tracing Multi-Agent Systems," 2024-2025

Latency breakdowns are workload-dependent. The "LLM is 30-50% of latency" figure reflects typical tool-using agent pipelines; your mix will vary based on which tools you call and how.