If you’re running a user-facing software service, it’s probably a distributed system. You might have a proxy, an application and a database, or a more complicated microservice architecture. Regardless of the level of complexity, a distributed system means that multiple distinct services must work together in concert.
Tracing helps tie together instrumentation from separate services, or from different methods within one service. This makes it easier to identify the source of errors, find performance problems, or understand how data flows through a large system.
At its core, tracing data works on a hierarchical call graph: it links together events that call each other.
Trace datasets are exactly the same as any other dataset in Honeycomb. We treat each span as an individual event. Therefore, you can group, filter and aggregate by any field on the span.
This model has a subtlety: A query on an aggregate function, like
SUM, will work across all spans in a dataset. This is almost never the expected behavior: the
COUNT across all events counts all spans of all traces; a
P95(duration_ms) would produce the 95th percentile across all spans, including trace roots.
Querying tracing data effectively requires some practice. Let’s look at some “recipes” for accessing certain types of data in a tracing dataset.
To analyze only root spans, query using:
WHERE trace.parent_id does-not-exist
Root spans do not have a parent defined. Thus, a
WHERE set would show the 95th percentile latency for only root
service calls, not for all spans. Likewise, a
MAX(duration_ms) would show you
the slowest of all traces per time-bucket.
Because Honeycomb supports high cardinality data, you can even group by very distinct event fields like the ID of the associated trace.
Hence you could:
GROUP BY trace.trace_id
to gain a view into the system showing the slowest spans per-trace.
Usually the slowest spans will be synonymous with the root span (because all other spans are nested within it), but asynchronous root spans are also possible.
Note: It is technically possible that if all the spans do not fall within
the calculated “time bucket” for the Honeycomb aggregates accessed by
VISUALIZE, the queries would return invalid results for this grouping.
However, this is uncommon as most traces complete within a smaller window than
the auto-computed bucket interval.
Similar to the above example, one could:
GROUP BY trace.trace_id
ORDER BY COUNT_DISTINCT(trace.span_id) DESC
to obtain a list of groupings where span count is the aggregate calculation.
ORDER will ensure that the traces with the most distinct span IDs end up
at the top of the table view.
Not all spans in a tracing dataset will contain the exact same fields. Some will be defined and some will not be defined, especially for user-defined custom fields. It is sometimes important to filter out the events where these fields are not defined, or else you will end up with large undefined groups that dominate the presentation of the data.
Especially when using a
GROUP BY, you may frequently want to have queries that
set this as a
WHERE condition. e.g.:
WHERE app.custom_field exists
GROUP BY app.custom_field
Sometimes you might be interested to know not only the aggregate statistics for how individual spans are performing, but how a given service is performing in general. You can see how much time is being spent in individual services using a query like the following:
GROUP BY service_name
This will show the sum of latency across all spans per service.
To view traces associated with your query, select the “Traces” tab. This will show you the slowest traces that match the query, calculated by root span duration.
The “Traces” tab will show a maximum of 10 traces, ordered from slowest to fastest. The span summary shows the distribution of spans.
If you want to see specific traces that correspond to a feature on the graph, such as a latency spike, you can select a region on the graph, and the list of trace summaries will update:
Click on the trace button, or one of the
trace.trace_id links to see the full waterfall diagram for that
You can get into traces in several different ways.
Any view that can show a
trace.trace_id will show a button that connects into the trace. This includes the Raw Data view or the Results view from a query. Adding a
trace.trace_id to a query will find traces.
In addition, clicking on either the line graph or the heatmap will jump directly to a trace corresponds to that point. Note that this will pick one trace that sits at that time. To find a specific trace, the Traces tab, Raw Data tab, or the Query Results will be more effective.
Note that, given a specific
TraceId, it is also possible to construct a URL that links directly to the waterfall for that trace.
The first row is the root span, the parent of the entire trace. Below it are each of the spans in this trace. The bars on the right display how much time each operation took to execute. This type of graph is also called a waterfall diagram, as it shows how the order of operations cascade across the total execution time.
Note: Just as query results in Honeycomb are permalinks that you’ll always be able to view even if the underlying data has aged out, the Waterfall View is also a query result and its URL is a permalink. Once you view a particular trace waterfall, you’ll always be able to load it at the same URL and see its span data, although you won’t be able to run additional queries about that time period (e.g. from the trace sidebar) if the underlying data has aged out.
To see more detail about an individual span, click on a row to view that span’s raw data in the right sidebar:
To display additional columns, open the column selection menu by clicking the plus icon in the header. Check the boxes for the columns you’d like to include. To hide a column, click on the column name and select “Hide column” from the menu. To collapse or expand a portion of the trace, click on the arrow on the left side of a span row.
When a trace contains many spans, it can be useful to collapse ones that aren’t of interest. Clicking on the span parent’s root box collapses a span’s children. To show they are collapsed, the number changes from an outlined box to a filled box, and the number changes from the number of direct children to the total size of the subtree.
In this screenshot, the first
FetchPartial is not collapsed, but the second one is, showing that it has 7 children. Below, both instances of
merging are collapsed; one has 64 spans below it, while the other has 44 spans.
In a particularly large trace, it can be convenient to collapse all spans that share a name, or that are at the same level of indent. The expand/collapse menu, accessed from the triple-dots to the left of the trace, controls this.
You can show and hide additional columns with the trace field selector. By default, the tracing UI shows the
service_name. Changing this can highlight errors or help scan for interesting columns.
By default, rows are colored by the service name if there is more than one service name defined; otherwise, they are colored by
name. Changing this can help color the waterfall to emphasize different attributes. Each distinct value gets its own color.
It can be useful to search across the entirety of a trace. For example, you might know that a full trace has a span with an error field somewhere within it. The “Search” box, located immediately above the trace, searches both span names and values for a given string. It also highlights any matching spans in the span sidebar.
The span sidebar also has a filter; this will filter the sidebar (only) to show only those fields that match the filter. This setting remains as you click between spans; it can be useful to jump between a set of spans to track a particular field.
When collapsing the tree, it can be difficult to see the extent of the child spans of a parent. For example, an async process might have a parent that returns in a few milliseconds – but it starts a database query that runs for several minutes. Honeycomb draws a thin line to the right of the parent span to show the duration of the child process. For example, in this figure, the span with
head.startQuery runs 2.533 ms – but it has children that run much longer. (The three subprocesses called
FetchPartial all run over 2.5 seconds).
Clock skew, temporal anomalies, or time travel can cause a child to start before the parent span. Honeycomb renders these with similar lines to the left of the main body.
Sometimes, a trace has some spans that claim to have a
trace.parent_id – but no span claims to have that
trace.span_id. Honeycomb renders a missing span in that slot: a placeholder that shows the span with all the children that point to that parent. Honeycomb also expects one span to have no parent; that is the root span. If there is no root span, then this is understood as a missing root; in that case, the root span is drawn as missing.
The minigraph, at the top of the trace view, shows a heatmap view of the selected span relative to others with the same
trace.systemName. Below, the trace shows that this is a long instance of
run_trigger_summary, but not an unusually long one. The X axis runs from several minutes before this trace to several minutes afterward.
Clicking on the minimap re-enters the query builder with the query that corresponds to the minigraph.
Any value, on any span in the trace, can be an entry point back to the query builder. The options in the drop down, including adding to
GROUP BY, will create a new query that re-uses the original query but adds these fields.
Note that the new query may not return results. If the fields you choose are only found on leaves, but the original query has
trace.parent_id does-not-exist, then the new query will be self-contradictory, and will not return any results. Remove any unneeded clauses to get a full result.
Here is an example of investigating a problem using tracing:
A few users are reporting slow API calls. The initial investigation shows one endpoint is periodically slow. One user is making a lot of calls, and that is slowing down responses for everyone.
Selecting a representative trace shows the full execution of the API request, helping us understand why these API requests are slow.
The first part seems normal, but
ticketBackend is making more calls than
mysql. Clicking on one of the
mysql spans shows more information
about it in the sidebar, including the specific database query.
The information from this trace points to a possible next step. Is the user running a script gone wrong? Or maybe this endpoint needs tighter rate limiting. Traces show a level of detail that can be hard to see otherwise.
Tracing can help shine a light into background/batch jobs as well! Check out this blog post for some examples.