Honeycomb Refinery is currently in closed beta - please email firstname.lastname@example.org if you would like to know more.
Honeycomb Refinery (Beta) is a hosted service to sample traces after they have been ingested.
Sampling at the client has several benefits. Sending less traffic reduces the amount of Honeycomb ingestion you need to pay for, and reduces the CPU and network resources associated with sending events. It is limited, however, because sampling decisions cannot be made on the content of the overall trace. Here are some scenarios that could benefit from smarter sampling:
With Honeycomb Refinery (Beta), you can apply sampling strategies to entire traces rather than individual events.
Refinery is enabled on a dataset, with one of two possible refinery strategies (see below). After enabling Refinery, Honeycomb servers begin buffering spans (events with a Trace ID). As events come in, they are grouped by Trace ID. When the root span arrives (or a timeout occurs), the complete trace is assembled from the available spans, and the selected strategy is applied. Refinery makes a sampling decision based on the data available at the time and the specified configuration. This sampling decision is recorded for a period of time, allowing spans that arrive after the sampling decision is made to also receive the same sampling decision.
This strategy delivers a given sample rate, weighting rare traffic and frequent traffic differently so as to end up with the correct average. Frequent traffic is sampled more heavily, while rarer events are kept or sampled at a lower rate. This is the strategy you want to use if you are concerned about keeping high-resolution data about unusual events while maintaining an a representative sample of your application’s behavior.
To see how this differs from random sampling in practice, consider a simple web service with the following characteristics:
If we sample events randomly, we can see these characteristics. We can do analysis of aggregates such as: what’s the average duration of an event, breaking down on fields like status code, endpoint, customer_id, etc. At a high level, we can still learn a lot about our data from a completely random sample. But what about those 50x errors? Typically, we’d like to look at these in high resolution - they might all have different causes, or affect only a subset of customers. Discarding them at the same rate that we discard events describing healthy traffic is unfortunate - they are much more interesting! Here’s where dynamic sampling can help.
With Dynamic Refinery, you set a target sampling rate. Dynamic Refinery will try to maintain that sampling rate, but will adjust the sample rate of traces and events events based on their frequency. To achieve the target sample rate, it will increase sampling on common events, while lowering the sample rate for less common events, all the way down to 1 (keeping unique events).
Let’s look at how random sampling vs dynamic sampling compare for our hypothetical web app:
As the illustration above demonstrates, random sampling will result in the rarer 50x events getting tossed out at the same rate as the much more common 200 traffic. With dynamic sampling, we sample the common traffic at a higher rate, while preserving more of the 50x events. Effectively, we’re trading some high-resolution data for common events in exchange for high-resolution data on rarer events.
Now what happens when you look at them in aggregate?
Because the Honeycomb query engine adjusts for sample rate when rendering graphs, we’re able to do the same aggregate analyses of the dataset that we could do before.
To enable Dynamic Refinery, click on the Refinery tab in your Dataset’s Settings page.
Check the box labeled Enable Refinery for this dataset, then choose Dynamic Refinery. Next, you need to choose a target sample rate. This is the target sample rate that the sampler will try to maintain. For example, if you choose 10, Refinery will try to adjust sample rates dynamically so that 1 in 10 of traces are kept.
For Fields Sampled, you will need to think about your dataset’s schema, and which fields help categorize your traffic. In our example, we used HTTP status code, but you can supply multiple keys. HTTP status code is interesting, but what if you want to keep events describing the http status codes encountered by individual customers? You could add your customer ID field as well - Dynamic Refinery accepts up to 10 fields. Think about which dimensions make your traffic interesting: request path, build version, error code, etc.
Choosing continuous fields like
duration_ms or index (arbitrary and unique) values like “Request ID” will effectively cause every event and trace to be kept, since all traffic will be considered rare. As a rule, if you wouldn’t do a group-by/break-down on it, it is likely not a good candidate for the Dynamic Refinery strategy.
A large proportion of your traffic is dominated by a few customers, but you also have many smaller customers that you’d like to keep data from. Simple random sampling causes data to be dropped for these smaller customers, and you don’t want to spend lots of money storing all event data just to keep high resolution data about the smaller customers. Dynamic sampling can help here by adjusting the sample rate to be proportional to the traffic coming in from each customer. Here, the field that differentiates traffic is the
customer_id field, so we’ll set a dynamic sampling policy with that field.
After committing the configuration, we can look at Usage Mode to see the new sampling rates take effect. We can visualize
HEATMAP(Sample Rate), and
AVG(Sample Rate), breaking down on
customer_id to observe how sample rates are adjusted proportionally to traffic.
Later, you might think: I also care about individual customer errors! Sampling customer traffic randomly, even if at at different rates, could still result in you missing data you care about, like a rare 500 error. Let’s add
status_code as another field in our Dynamic Refinery key.
You may have noticed the
Add Dynamic Sampling Key box we’ve checked. When checked, this adds a new field to your dataset -
meta.dynamic_sampling_key - which shows you the key value used when sampling the traffic. This key is assembled based on the fields you’ve selected in the Dynamic Sampling Strategy configuration. You can use this to understand which sampling rates are being used for each category of traffic.
The Rule-based Refinery strategy allows you to define sampling rates explicitly based on the contents of your traces. Using a filter language that is similar to what you see when running queries, you can define conditions on fields across all spans in your trace. For instance, if your root span has a
status_code field, and the span wrapping your database call has an
error field, you can define a condition that must be met on both fields, even though the two fields are technically separate events. You can supply a sample rate to use when a match is found, or optionally drop all events in that category. Some examples of rules you might want to specify:
Rules are evaluated in order, and the first match is used. For this reason, define more specific rules at the top of the list of rules, and broader rules at the bottom. If no rules match, a configurable default sampling rate is applied.
You have a large amount of event data, and want to reduce your ingestion by 90%. You also have some specific types of data that you don’t need at all, and some that you absolutely do not want to miss: