Tracing LLM Agents: Multi-Span Traces, Tool Calls, and Where Latency Hides
A single LLM call is easy to monitor — log the prompt, the response, the latency. An agent that calls tools, retrieves documents, and delegates to sub-agents is a pipeline, and pipelines need structured traces. The unit of observability for agents is the multi-span trace: a tree of spans capturing every step, its duration, its cost, and its parent-child relationship. Here is how to instrument agent traces, where latency actually hides, and what the trace reveals that flat logs cannot.
Why agents break flat logging
Traditional LLM monitoring logs each model call as an independent event: prompt in, response out, latency, tokens, cost. This works for a single-shot completion. It collapses completely for agents. When a user asks an agent a question, the agent might: (1) call the LLM to plan, (2) call a search tool, (3) call the LLM to interpret results, (4) call a code-execution tool, (5) call the LLM to synthesize the answer. That is 3-5 LLM calls and 2+ tool calls, all in service of one user-visible response.
Flat logging shows five disconnected events with no way to know they belong to the same user turn. You cannot answer "how long did this agent turn take end-to-end?" or "which step was the bottleneck?" or "did the search tool return garbage that caused the bad final answer?" without the parent-child structure that a trace provides.
For agent observability, the fundamental unit is the trace (one user turn), not the LLM call. Every LLM call, tool call, and retrieval is a span within that trace. Optimize your dashboards around trace-level metrics (end-to-end latency, total cost per turn, tool success rate) before span-level ones.
The span hierarchy
A well-instrumented agent trace is a tree. The root span is the user request. Children are the major phases — planning, tool execution, synthesis. Each tool execution spawns its own children for the tool's internal steps (a retrieval tool might have sub-spans for embedding, vector search, and reranking). Sub-agents appear as nested traces within the parent trace.
# Trace structure for a research-agent turn
TRACE: user asks "summarize Q3 revenue drivers" (12.4s total)
├─ span: planner LLM call (1.2s, 850 tok, $0.012)
├─ span: tool: search_documents (2.8s)
│ ├─ span: embed query (0.1s)
│ ├─ span: vector search (2.5s) ← bottleneck
│ └─ span: rerank results (0.2s)
├─ span: tool: get_financials (1.1s)
├─ span: interpreter LLM call (2.1s, 2200 tok, $0.022)
├─ span: tool: code_exec (chart) (3.8s)
└─ span: synthesizer LLM call (1.4s, 1500 tok, $0.015)
# Total cost: $0.049. Total latency: 12.4s.
# Vector search dominates at 2.5s (20% of total).
This structure makes optimization actionable. Without it, you know the turn took 12.4 seconds; with it, you know the vector search took 2.5 of those seconds and is the next thing to optimize. Every tracing platform — LangSmith, Langfuse, Phoenix, OpenTelemetry — renders this tree and lets you drill into any span.
Where latency actually hides
When teams first instrument agent traces, the consistent finding is that the LLM call is only 30-50% of total latency. The rest is tool I/O and retrieval. Here are the most common latency sinks, in order of how often trace analysis surfaces them.
| Latency sink | Typical cost | Fix |
|---|---|---|
| Sequential tool calls that could be parallel | 2-5s wasted | Run independent tools concurrently |
| Slow vector search (brute-force, no ANN index) | 1-3s per retrieval | Use HNSW or IVF; cache embeddings |
| Unnecessary re-planning LLM calls | 1-2s per redundant call | Cache plans for similar intents |
| Verbose system prompts inflating prefill | 0.5-1.5s | Trim prompt; enable prefix caching |
| Synchronous external API calls (no timeout) | Variable, unbounded | Add timeouts + circuit breakers |
| Re-reading full context per turn instead of caching | Compounds over turns | Use KV cache reuse / RadixAttention |
The single highest-ROI optimization trace analysis reveals is parallelizing independent tool calls. If your agent calls a search tool and a database tool that don't depend on each other, running them concurrently cuts that phase's latency in half. Many agent frameworks default to sequential execution; traces make the waste obvious.
Instrumenting traces with Langfuse
Langfuse (open-source, self-hostable) is the most framework-agnostic option. The instrumentation is a decorator-based API that automatically creates spans and nests them correctly.
# Langfuse agent tracing (Python SDK)
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe() # creates the root trace span
def research_agent(query: str) -> str:
plan = plan_step(query) # each @observe() child auto-nests
docs = search_tool(plan)
answer = synthesize(query, docs)
return answer
@observe(as_type="generation") # marks this as an LLM call
def plan_step(query: str) -> str:
return llm.complete(PLAN_PROMPT.format(query=query))
@observe() # marks this as a tool call
def search_tool(plan: str) -> list:
return vector_db.search(plan)
# Every span auto-records duration, tokens (for generations),
# and cost. View the tree in the Langfuse UI.
The key principle is that each function that should appear in the trace gets an @observe() decorator, and Langfuse handles the parent-child nesting via a context variable. You do not manually pass trace IDs around. See our Self-host Langfuse tutorial for the full deployment setup.
OpenTelemetry and GenAI semantic conventions
If your team already runs OpenTelemetry for backend observability (Datadog, Honeycomb, Grafana Tempo), you do not need a separate LLM tracing tool. The OpenTelemetry GenAI semantic conventions (stabilized in 2024-2025) define standard attributes for LLM spans: gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and so on. Instrument your agent with OTel spans using these attributes and they render natively in your existing observability stack.
The trade-off: OTel requires more manual instrumentation than Langfuse or LangSmith's decorator magic, and the GenAI conventions are newer with less community tooling. But for teams committed to a single observability vendor, it avoids running a parallel LLM-specific platform. Our platform comparison breaks down when each approach wins.
Metrics that matter for agents
Beyond per-trace debugging, aggregate trace metrics drive operational decisions. The dashboards that matter for agent workloads differ from simple LLM-call dashboards.
| Metric | What it tells you | Healthy range |
|---|---|---|
| P50/P99 trace latency | End-to-end user experience | P50 <5s, P99 <15s (varies by use case) |
| Cost per trace | Spend efficiency per user turn | Track trend, alert on spikes |
| Tool success rate | Reliability of non-LLM steps | >95% per tool |
| LLM calls per trace | Agent efficiency (fewer = cheaper, faster) | Lower is better; alert on creep |
| Trace error rate | Agent failed to produce a valid response | <2% |
| Token cost per trace | Breakdown of where tokens go | Track by span type |
If your average LLM calls per trace creeps from 3.2 to 4.5 over a month, your agent is getting chattier — usually because of fallback logic or a degrading tool that triggers retries. This metric is an early warning signal for both cost and latency regressions.
Connecting traces to evals
Traces tell you what happened; evals tell you whether it was good. The two are most powerful when connected: tag each trace with its eval scores (or run LLM-as-a-Judge asynchronously on sampled traces), and you can slice latency and cost by quality. "The traces that scored below 4/5 on helpfulness also cost 40% more — they were retrying failed tool calls" is the kind of insight that only trace-plus-eval surfaces.
All three major platforms (LangSmith, Langfuse, Phoenix) support attaching eval scores to traces. The pattern is: run the trace in production, asynchronously score the final output with an LLM judge or rubric, and write the score back as a trace attribute. Then your dashboards can filter and aggregate by quality, not just by operational metrics.
FAQ
What is an LLM agent trace?
A tree of spans capturing every step in an agent's execution — the initial LLM call, tool invocations, retrieval lookups, sub-agent delegations, and final response. Each span records duration, input, output, token cost, and parent-child relationship, letting you localize latency and cost to the responsible step.
Why do agent traces need multi-span structure?
Because one user turn triggers multiple LLM and tool calls in sequence or parallel. A flat log shows disconnected events; a multi-span trace captures the parent-child hierarchy (request → planning → tool calls → synthesis), making it possible to see which step dominated latency or caused a bad outcome.
Where does latency hide in agent pipelines?
Mostly in non-LLM steps: sequential tool calls that could be parallelized (biggest opportunity), slow vector searches, unnecessary re-planning calls, verbose system prompts inflating prefill, and synchronous external APIs without timeouts. Trace analysis typically shows the LLM call is only 30-50% of total latency.
LangSmith vs Langfuse vs OpenTelemetry for agent tracing?
LangSmith is most polished for LangChain-native agents but proprietary SaaS. Langfuse is open-source, framework-agnostic, and self-hostable — best for vendor independence. OpenTelemetry with GenAI semantic conventions integrates with existing observability stacks (Datadog, Honeycomb) — best if your team already runs OTel. See our platform comparison.
Related deep dives
- LangSmith vs Langfuse vs Phoenix — which tracing platform fits your stack
- Self-host Langfuse — deployment guide for the open-source tracing platform
- LLM-as-a-Judge — how to score traces for quality and connect evals to observability
- Hallucination Detection in Production — using traces to catch when agents fabricate
- Cost Monitoring Dashboards — per-span cost data rolls up into team-level spend visibility
- Prefix Caching — reduces the prefill latency that traces often surface
Sources
- OpenTelemetry, "GenAI Semantic Conventions," 2024-2025
- Langfuse documentation, "Tracing and Observability for LLM Applications," 2025
- LangSmith documentation, "Agent Tracing and Evaluation," 2025
- Arize AI, "Phoenix: AI Observability and Evaluation," 2025
- Databricks, "Observing Agent Workflows in Production," 2025
- LangChain blog, "Tracing Multi-Agent Systems," 2024-2025
Latency breakdowns are workload-dependent. The "LLM is 30-50% of latency" figure reflects typical tool-using agent pipelines; your mix will vary based on which tools you call and how.