We recommend the following guidelines for sampling your data.
Primary reasons to sample data, all of which are typically related to one another, include:
The biggest question that comes up with sampling is: “How do I make sure I am not missing important data I might need?”. The answer depends on your data. If you do not have a lot of data in the first place, sampling is more trouble than it is worth. If you have a lot of data, but it is fairly uniform or it is not critical that you capture everything that may be interesting to you right now, you can often get away with a simple sampling strategy. If you have a lot of conditions that matter to you, or irregular traffic patterns across your services, you will need a more sophisticated sampling strategy.
Honeycomb offers tools to help you sample your data in a way that is tailored to your needs.
If your service receives more than 1000 requests per second, you should strongly consider sampling. Your sampling strategy should allow cost and resolution to determine your optimal sample rate.
Sampling is an essential part of tracing at a large scale. Consider different kinds of traces:
The large majority of the time, the first kind–traces that finish successfully with no errors–far outnumber all the other kinds. Traces in the first category are still critical to have because they represent what “healthy” behavior looks like, and you will need them for comparison when you are looking at the other kinds of traces. However, you most likely do not need all of them. A sample of these traces will be enough to understand the overall health of your system.
The other three categories are much more interesting, and depending on your needs, you may want to take larger samples (or even 100%) of these traces.
COUNT_DISTINCT
and Sampling Use the COUNT_DISTINCT
operator in Query Builder with care when working with sampled data.
COUNT_DISTINCT
uses the HyperLogLog algorithm, which is designed to work on an entire population of data.
Therefore, it is only possible for COUNT_DISTINCT
to count items that are actually present in the data.
COUNT_DISTINCT
cannot and does not compensate for sampling rate.
However, when using COUNT_DISTINCT
in a query, users can view the average sample rate
for the query.
Locate it in the metadata below the result summary table with elapsed query time
and rows examined
fields.
average sample rate
displays the average sample rate across all underlying events included in the query result.
Until there’s a way to accurately count unique values in sampled sets, use COUNT_DISTINCT
with caution when working with sampled data.
Use Honeycomb’s Usage Mode to examine data without compensating for sample rates.
Sometimes, you may want to run calculations on only the sampled data with no weighting of counts. You can use the Usage Center’s Usage Mode feature to do this.