When to Sample | Honeycomb

When to Sample

We recommend the following guidelines for sampling your data.

Why Sampling 

Primary reasons to sample data, all of which are typically related to one another, include:

  • Reduce total data volume. A representative sample of your data will typically be much smaller than the entire volume of data produced.
  • Ensure you sample interesting traces. The question of representativeness can be nuanced if you have a wide variety of traffic, especially if it is irregular.
  • Filter out noise. For services with predictable traffic patterns, a small sample can often be enough to capture representative behavior of your services.

The biggest question that comes up with sampling is: “How do I make sure I am not missing important data I might need?”. The answer depends on your data. If you do not have a lot of data in the first place, sampling is more trouble than it is worth. If you have a lot of data, but it is fairly uniform or it is not critical that you capture everything that may be interesting to you right now, you can often get away with a simple sampling strategy. If you have a lot of conditions that matter to you, or irregular traffic patterns across your services, you will need a more sophisticated sampling strategy.

Honeycomb offers tools to help you sample your data in a way that is tailored to your needs.

When to Sample: 1000 requests per second 

If your service receives more than 1000 requests per second, you should strongly consider sampling. Your sampling strategy should allow cost and resolution to determine your optimal sample rate.

Sampling and Tracing 

Sampling is an essential part of tracing at a large scale. Consider different kinds of traces:

  • Traces that finish successfully with no errors
  • Traces with specific attributes on them
  • Traces with high latency
  • Traces with errors on them

The large majority of the time, the first kind–traces that finish successfully with no errors–far outnumber all the other kinds. Traces in the first category are still critical to have because they represent what “healthy” behavior looks like, and you will need them for comparison when you are looking at the other kinds of traces. However, you most likely do not need all of them. A sample of these traces will be enough to understand the overall health of your system.

The other three categories are much more interesting, and depending on your needs, you may want to take larger samples (or even 100%) of these traces.

COUNT_DISTINCT and Sampling 

Use the COUNT_DISTINCT operator in Query Builder with care when working with sampled data.

COUNT_DISTINCT uses the HyperLogLog algorithm, which is designed to work on an entire population of data. Therefore, it is only possible for COUNT_DISTINCT to count items that are actually present in the data.

COUNT_DISTINCT cannot and does not compensate for sampling rate. However, when using COUNT_DISTINCT in a query, users can view the average sample rate for the query. Locate it in the metadata below the result summary table with elapsed query time and rows examined fields. average sample rate displays the average sample rate across all underlying events included in the query result.

Until there’s a way to accurately count unique values in sampled sets, use COUNT_DISTINCT with caution when working with sampled data. Use Honeycomb’s Usage Mode to examine data without compensating for sample rates.

Usage Mode 

Sometimes, you may want to run calculations on only the sampled data with no weighting of counts. You can use the Usage Center’s Usage Mode feature to do this.