Sampling is the concept of selecting a few elements from a large collection and learning about the entire collection by extrapolating from the selected set. This page covers terminology, reasons for sampling, and different sampling methods and tools, such as head sampling, and tail sampling.
It’s important to use consistent terminology when discussing sampling. A trace or span is considered “sampled” or “not sampled”:
Sometimes, the definitions of these terms get mixed up in conversation or online. You may find someone state that they are “sampling out data” or that data not processed or exported is considered “sampled”. While the behavior they describe may be the same, these are incorrect terms.
Primary reasons to sample data, all of which are typically related to one another, include:
The biggest question that comes up with sampling is: “How do I make sure I am not missing important data I might need?”. The answer depends on your data. If you do not have a lot of data in the first place, sampling is more trouble than it is worth. If you have a lot of data, but it is fairly uniform or it is not critical that you capture everything that may be interesting to you right now, you can often get away with a simple sampling strategy. If you have a lot of conditions that matter to you, or irregular traffic patterns across your services, you will need a more sophisticated sampling strategy.
Honeycomb offers tools to help you sample your data in a way that is tailored to your needs.
If your service receives more than 1000 requests per second, you should strongly consider sampling. Your sampling strategy should allow cost and resolution to determine your optimal sample rate.
Sampling is an essential part of tracing at a large scale. Consider different kinds of traces:
The large majority of the time, the first kind–traces that finish successfully with no errors–far outnumber all the other kinds. Traces in the first category are still critical to have because they represent what “healthy” behavior looks like, and you will need them for comparison when you are looking at the other kinds of traces. However, you most likely do not need all of them. A sample of these traces will be enough to understand the overall health of your system.
The other three categories are much more interesting, and depending on your needs, you may want to take larger samples (or even 100%) of these traces.
There are many different techniques for sampling data, all with different tradeoffs. Each technique falls into one of two categories: head sampling and tail sampling.
Head sampling is when you sample traces without looking at the entire trace. The decision to sample or not sample a span in a trace is often made as early as possible. In OpenTelemetry, a head sampling decision is made during span creation–unsampled spans are never even created.
The most common form of head sampling is deterministic probability sampling. Given a constant sampling rate that represents a fixed percentage of traces to sample, the sampler will make a decision to sample or not sample spans based on using the trace ID as a random number. Using the trace ID allows disparate samplers to make consistent decisions for all of the spans in a trace.
All of Honeycomb’s SDKs support deterministic probability sampling:
Deterministic probability sampling is also supported by every other OpenTelemetry SDK.
Head sampling is a blunt instrument. It is simple to configure and requires no additional infrastructure or operational overhead.
But what head sampling offers in simplicity, it loses in flexibility:
To accomplish the above, you need to use tail sampling instead.
Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Honeycomb offers Refinery as a tail sampling solution to install in your environment. Because tail sampling is done by inspecting whole traces, it enables you to apply many different sampling techniques. Some of these techniques include:
http.status_code
will sample much less traffic for requests that return 200
than for requests that return 404
.Tail sampling with Refinery lets you combine all of the above techniques in arbitrary ways to create a sampling strategy that is tailored to your needs.
Tail sampling with Refinery lets you sample traces in just about any way you can imagine. How you configure tail sampling depends on your needs and the complexity of your system.
Most people tend to follow some common patterns:
http.status_code
to sample traces proportionally across all values of that keyThe rules and key configuration will often have to take into account attributes that are unique to your system.
The flexibility and sophistication of tail sampling comes at a price: it is more effort to configure and requires additional infrastructure and operational overhead to run. For extremely high-volume systems, you may also need to combine head sampling and tail sampling to protect your infrastructure from huge spikes of data.
When you sample your data with our sampling techniques, each span in a trace is given a SampleRate
attribute that represents N
when you only sample 1/N
traces.
This allows Honeycomb to weight counts to compensate for the fact that you are sampling your data.
Example: You are doing head sampling at a 10% sampling rate, meaning that only 10% of traces are exported to Honeycomb:
Trace ID | Sample Rate (on each span) | duration_ms |
---|---|---|
abcd1234 | 10 | 200 |
4321dcba | 10 | 1100 |
In this case, the SampleRate
attribute is set to 10
because you are sampling 10% of traces, or 1 in 10 traces.
Because Honeycomb has this information, it can calculate accurate values for various aggregations:
COUNT
of traces: (2 * 10) = 20
AVG(duration_ms)
: ((200 * 10) + (1100 * 10)) / (10 + 10) = 650
In other words, you can send less data and yet still see usefully accurate data in Honeycomb. This is true for SUM and Percentile aggregations as well.
The math done on the Honeycomb backend combined with the flexibility of setting the SampleRate
attribute means that you can use sampling techniques as simple or sophisticated as you need, and Honeycomb will do the rest provided that the sampler sets the SampleRate
attribute.
When you use Refinery, this is done automatically for its dynamic samplers.
COUNT_DISTINCT
and Sampling Use the COUNT_DISTINCT
operator in Query Builder with care when working with sampled data.
COUNT_DISTINCT
uses the HyperLogLog algorithm, which is designed to work on an entire population of data.
Therefore, it is only possible for COUNT_DISTINCT
to count items that are actually present in the data.
COUNT_DISTINCT
cannot and does not compensate for sampling rate.
However, when using COUNT_DISTINCT
in a query, users can view the average sample rate
for the query.
Locate it in the metadata below the result summary table with elapsed query time
and rows examined
fields.
average sample rate
displays the average sample rate across all underlying events included in the query result.
Until there’s a way to accurately count unique values in sampled sets, use COUNT_DISTINCT
with caution when working with sampled data.
Use Honeycomb’s Usage Mode to examine data without compensating for sample rates.
Sometimes, you may want to run calculations on only the sampled data with no weighting of counts. You can use the Usage Center’s Usage Mode feature to do this.