If you’re running a user-facing software service, it’s probably a distributed system. You might have a proxy, an application and a database, or a more complicated microservice architecture. Regardless of the level of complexity, a distributed system means that multiple distinct services must work together in concert.
Tracing helps tie together instrumentation from separate services, or from different methods within one service. This makes it easier to identify the source of errors, find performance problems, or understand how data flows through a large system.
A trace tells the story of a complete unit of work in your system.
For example, when a user loads a web page, their request might go to an edge proxy. That proxy talks to a frontend service, which calls out to an authorization and a rate-limiting service. There could be multiple backend services, each with its own data store. Finally, the frontend service returns a result to the client.
Each part of this story is told by a span. A span is a single piece of instrumentation from a single location in your code. It represents a single unit of work done by a service. Each tracing event, one per span, contains several key pieces of data:
A trace is made up of multiple spans. Honeycomb uses the metadata from each span to reconstruct the relationships between them and generate a trace diagram.
The image below is a portion of a trace diagram for an incoming API request:
In this example, the
/api/v2/tickets/export endpoint first checks
if the request is allowed by the rate limiter. Then it authenticates the
requesting user, and finally fetches the tickets requested. Each of
those calls also called a datastore.
You can see in the trace diagram the order these operations were executed, which service called which other service, and how long each call took.
Events are the fundamental unit of data in Honeycomb, and this doesn’t change when you use tracing. Spans are just regular Honeycomb events that have fields (like parentID and traceID) that describe their relationship to other spans (events) in the same unit of work (a trace). A trace only exists as a group of spans (events) with the same traceID.
Sampling your data is a great way to increase your retention and get data volume to a manageable size.
Many folks are curious about how sampling works with tracing, given that simply sampling 1/N requests at random will not guarantee that you retain all of the spans for a given trace. The story of how sampling and tracing fit together with Honeycomb is still evolving, but here are some thoughts on how to approach it.
Traditionally, the way traces are sampled is head-based sampling: when the
root span is being processed, a random sampling decision is made (e.g., if
randint(10) == 0, the span will be sampled). If that span is decided to be
sampled, it gets sent and propagates that decision out to the descendent spans,
who follow suit, usually by a method like HTTP header (something like
X-B3-Sampled: 1). That way, all the spans for a particular trace are
preserved. Our integrations do not support head-based sampling today out of the
box, but you could implement such a system yourself.
Some of our integrations do support what we call deterministic sampling. In deterministic sampling, a hash is made of a specific field in the event/span such as the request ID, and a decision to sample is made based on that hash and the intended sample rate. Hence, an approximately correct number of traces will be selected and the decision whether or not to sample a given trace does not need to be propagated around: actors can sample full traces whether they can communicate or not.
There is another option: tail-based sampling, where sampling decisions are made when the full trace information has been gathered. This ensures that if an error or slowness happens way down the tree of service calls, the full events of the trace are more likely to get sampled in. To use this method of sampling, all spans must be collected at a buffer ahead of time.
There are many ways to create tracing data. The following is a comparison of two popular implementations: OpenCensus and OpenTracing. In the following examples, both were set up to send data to Honeycomb and to Zipkin.
OpenCensus: A vendor-agnostic tool that provides metrics collection and tracing for your services.
OpenTracing is a vendor-neutral standard for distributed tracing data.
Zipkin: A distributed tracing system. For any Zipkin implementation, you need to have a Zipkin server running to receive the data.
All of the tracing instrumentation APIs produce the same trace data. Here is an example of the output in Honeycomb:
Zipkin and Honeycomb have pretty similar waterfall diagrams. Instrumenting them both with OpenCensus was incredibly easy. In fact, switching from one to the other is just a matter of changing the initial configuration. Below, you can see a line-for-line comparison of the differences between the two setups:
We ran the same example app for both of Zipkin and Honeycomb, but with different Docker processes running. The resulting output is the same as with OpenCensus.
Below is a comparison of instrumenting traces with OpenCensus versus OpenTracing:
StartSpan and OpenTracing’s
StartSpanFromContext functions take a context as the first parameter. Under the hood, this looks into the context to see if there is an existing span there. If not, a new span is added to the context. If there is an existing span, a new child span is added, a bit like nesting dolls. In the above example, the root (or parent as it’s sometimes called) span is added in the first line of
readEvaluateProcess, and then child and sibling spans are added in
processLine. Child spans can have their own child spans, creating very deep nests, but in this very simple example the root span has two sibling child spans.
The instrumentation of OpenTracing and OpenCensus is very similar, and either implementation can send data to Honeycomb.
Dive into the documentation below to get started tracing your own services: