Explore Distributed Trace Data | Honeycomb

Explore Distributed Trace Data

If you are running a user-facing software service, it is probably a distributed system. You might have a proxy, an application and a database, or a more complicated microservice architecture. Regardless of the level of complexity, a distributed system means that multiple distinct services must work together in concert.

Tracing helps tie together instrumentation from separate services, or from different methods within one service. This makes it easier to identify the source of errors, find performance problems, or understand how data flows through a large system.

At its core, tracing data works on a hierarchical call graph: it links together events that call each other.

This section discusses how to explore tracing data. Learn how to send tracing data to Honeycomb. If missing spans or other issues, refer to our troubleshooting guide. Our tracing Sandbox demo shows how to use Tracing and Bubbleup together to explore complex problems.

Querying Traces in the Query Builder 

Traces in datasets are exactly the same as any other information in datasets in Honeycomb. We treat each span as an individual event. Therefore, you can group, filter, and aggregate by any field on the span.

This model has a subtlety: A query on an aggregate function, like COUNT or SUM, will work across all spans in a dataset. This is almost never the expected behavior: the COUNT across all events counts all spans of all traces; a P95(duration_ms) would produce the 95th percentile across all spans, including trace roots.

Querying tracing data effectively requires some practice. Let us look at some “recipes” for accessing certain types of data in a tracing dataset.

Analysis at a High Level 

To analyze only root spans, query using:

  • WHERE trace.parent_id does-not-exist

Root spans do not have a parent defined. Thus, a VISUALIZE P95(duration_ms) with this WHERE set would show the 95th percentile latency for only root service calls, not for all spans. Likewise, a MAX(duration_ms) would show you the slowest of all traces per time-bucket.

Latency for Every Individual Trace 

Because Honeycomb supports high cardinality data, you can even group by very distinct event fields like the ID of the associated trace.

Hence you could:

  • VISUALIZE MAX(duration_ms)
  • GROUP BY trace.trace_id

to gain a view into the system showing the slowest spans per-trace.

Grouping by trace ID to find the slowest traces

Usually the slowest spans will be synonymous with the root span (because all other spans are nested within it), but asynchronous root spans are also possible.

It is technically possible that if all the spans do not fall within the calculated “time bucket” for the Honeycomb aggregates accessed by VISUALIZE, the queries would return invalid results for this grouping. However, this is uncommon as most traces complete within a smaller window than the auto-computed bucket interval.

Span Count Per Trace 

Similar to the above example, one could:

  • VISUALIZE COUNT_DISTINCT(trace.span_id)
  • GROUP BY trace.trace_id
  • ORDER BY COUNT_DISTINCT(trace.span_id) DESC

to obtain a list of groupings where span count is the aggregate calculation.

Counting the number of distinct spans per trace

The ORDER will ensure that the traces with the most distinct span IDs end up at the top of the table view.

Analyze Spans With a Particular Field 

Not all spans in a tracing dataset will contain the exact same fields. Some will be defined and some will not be defined, especially for user-defined custom fields. It is sometimes important to filter out the events where these fields are not defined, or else you will end up with large undefined groups that dominate the presentation of the data.

Especially when using a GROUP BY, you may frequently want to have queries that set this as a WHERE condition. For example:

  • VISUALIZE COUNT
  • WHERE app.custom_field exists
  • GROUP BY app.custom_field

Querying only spans that have a particular field

Determine Aggregate Time Spent in a Service 

Sometimes you might be interested to know not only the aggregate statistics for how individual spans are performing, but how a given service is performing in general. You can see how much time is being spent in individual services using an environment query like the following:

  • VISUALIZE SUM(duration_ms)
  • GROUP BY service_name

Querying total time spent in a service

This will show the sum of latency across all spans per service.

Using the Traces Tab 

To view traces associated with your query, select the “Traces” tab. This will show you the slowest traces that match the query, calculated by root span duration.

trace summary table

The “Traces” tab will show a maximum of 10 traces, ordered from slowest to fastest. The span summary shows the distribution of spans.

If you want to see specific traces that correspond to a feature on the graph, such as a latency spike, you can select a region on the graph, and the list of trace summaries will update:

zoomed graph selection

Click on the trace button, or one of the trace.trace_id links to see the full waterfall diagram for that trace.

Other Entry Points to Traces 

You can get into traces in several different ways.

Any view that can show a trace.trace_id will show a button that connects into the trace. This includes the Raw Data view or the Results view from a query. Adding a GROUP BY trace.trace_id to a query will find traces.

raw data mode table with clickable trace id

In addition, clicking on either the line graph or the heatmap will jump directly to a trace corresponds to that point. Note that this will pick one trace that sits at that time. To find a specific trace, the Traces tab, Raw Data tab, or the Query Results will be more effective.

Click line to trace Click heatmap to trace

Note that, given a specific TraceId, it is also possible to construct a URL that links directly to the waterfall for that trace.

Using the Waterfall View 

The first row is the root span, the parent of the entire trace. Below it are each of the spans in this trace. The bars on the right display how much time each operation took to execute. This type of graph is also called a waterfall diagram, as it shows how the order of operations cascade across the total execution time.

Just as query results in Honeycomb are permalinks that you will always be able to view even if the underlying data has aged out, the Waterfall View is also a query result and its URL is a permalink. Once you view a particular trace waterfall, you will always be able to load it at the same URL and see its span data, although you will not be able to run additional queries about that time period (from the trace sidebar, for example) if the underlying data has aged out.

In Honeycomb Classic, only events in your current dataset are available and visible in your waterfall views. Learn more about Honeycomb Classic.

Trace diagram

To see more detail about an individual span, click on a row to view that span’s raw data in the right sidebar:

Trace sidebar

The trace waterfall view supports keyboard navigation. See a list of supported actions by typing ?

Collapsing Rows 

When a trace contains many spans, it can be useful to collapse ones that are not of interest. Clicking on the span parent’s root box collapses a span’s children. To show they are collapsed, the number changes from an outlined box to a filled box, and the number changes from the number of direct children to the total size of the subtree.

In this screenshot, the first FetchPartial is not collapsed, but the second one is, showing that it has 7 children. Below, both instances of merging are collapsed; one has 64 spans below it, while the other has 44 spans.

Collapsed spans

In a particularly large trace, it can be convenient to collapse all spans that share a name, or that are at the same level of indent. The expand/collapse menu, accessed from the triple-dots to the left of the trace, controls this.

Expand collapse menu

Error Highlighting 

The count of spans containing errors is displayed above the trace waterfall, and by default, spans that contain errors will be displayed in red. The error field will also be highlighted at the top of the field list in the Span Details sidebar. Error highlighting can be toggled on and off from the Fields menu. Span error detection is based on the Error field configured in the Dataset Definition.

Error highlighting

Showing, Hiding, Coloring by Columns 

You can show and hide additional columns with the trace field selector. By default, the tracing UI shows the name and service.name. Changing this can highlight errors or help scan for interesting columns.

Field selector

By default, rows are colored by the service name if there is more than one service name defined; otherwise, they are colored by name. Changing this can help color the waterfall to emphasize different attributes. Each distinct value gets its own color.

Searching The Trace, Filtering Spans 

It can be useful to search across the entirety of a trace. For example, you might know that a full trace has a span with an error field somewhere within it. The “Search” box, located immediately above the trace, searches both span names and values for a given string. It also highlights any matching spans in the span sidebar.

Search a trace

The span sidebar also has a filter; this will filter the sidebar (only) to show only those fields that match the filter. This setting remains as you click between spans; it can be useful to jump between a set of spans to track a particular field.

Missing And Async Spans 

When collapsing the tree, it can be difficult to see the extent of the child spans of a parent. For example, an async process might have a parent that returns in a few milliseconds – but it starts a database query that runs for several minutes. Honeycomb draws a thin line to the right of the parent span to show the duration of the child process. For example, in this figure, the span with name head.startQuery runs 2.533 ms – but it has children that run much longer. (The three subprocesses called FetchPartial all run over 2.5 seconds).

Async spans

Clock skew, temporal anomalies, or time travel can cause a child to start before the parent span. Honeycomb renders these with similar lines to the left of the main body.

Sometimes, a trace has some spans that claim to have a trace.parent_id – but no span claims to have that trace.span_id. Honeycomb renders a missing span in that slot: a placeholder that shows the span with all the children that point to that parent. Honeycomb also expects one span to have no parent; that is the root span. If there is no root span, then this is understood as a missing root; in that case, the root span is drawn as missing.

Missing spans

The Minigraph 

The minigraph, at the top of the trace view, shows a heatmap view of the selected span relative to others with the same name and trace.systemName. Below, the trace shows that this is a long instance of run_trigger_summary, but not an unusually long one. The X axis runs from several minutes before this trace to several minutes afterward.

Trace minimap

Clicking on the minimap re-enters the query builder with the query that corresponds to the minigraph.

Returning to The Query Builder 

Any value, on any span in the trace, can be an entry point back to the query builder. The options in the drop down, including adding to WHERE or GROUP BY, will create a new query that re-uses the original query but adds these fields.

Trace click to query

Note that the new query may not return results. If the fields you choose are only found on leaves, but the original query has trace.parent_id does-not-exist, then the new query will be self-contradictory, and will not return any results. Remove any unneeded clauses to get a full result.

Example: Slow API Calls 

Here is an example of investigating a problem using tracing:

A few users are reporting slow API calls. The initial investigation shows one endpoint is periodically slow. One user is making a lot of calls, and that is slowing down responses for everyone.

Query results showing one user making lots of requests to /api/v2/tickets/export

Selecting a representative trace shows the full execution of the API request, helping us understand why these API requests are slow.

Example slow trace

The first part seems normal, but ticketBackend is making more calls than expected to mysql. Clicking on one of the mysql spans shows more information about it in the sidebar, including the specific database query.

Trace showing ticketBackend calling mysql many times sequentially

The information from this trace points to a possible next step. Is the user running a script gone wrong? Or maybe this endpoint needs tighter rate limiting. Traces show a level of detail that can be hard to see otherwise.

Working With Very Large Traces 

Honeycomb displays traces with a maximum of 64,000 spans. If you load a trace larger than 64,000 spans into the trace viewer, Honeycomb will display the 64,000 spans closest to the root, using a breadth-first search, and will display a warning message. Honeycomb will omit the remainder of the spans. In addition, Honeycomb will display placeholder spans showing where the omitted spans would go.

To work with very large traces, use the CONCURRENCY aggregate to visualize the structure of the trace, as seen in the Trace Summary example.

Do More With Tracing 

Tracing can help shine a light into background/batch jobs as well! Check out this blog post for some examples.

Did you find what you were looking for?