Supported Sampling Methods | Honeycomb

We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

Supported Sampling Methods

In rules.toml, you can specify different sampling methods and specify options for each. You can use the same sampling method and rate for all your datasets, or you can define specific sampling strategies and rules for each dataset.

The default configuration uses the DeterministicSampler and a SampleRate of 1, meaning that no traffic will be dropped.

Sampling methods and rules are configured in rules.toml. See GitHub for an example rules file.

After setting up or modifying sampling rules for your dataset(s), we recommend validating your configuration and doing a Dry Run before dropping your traffic.

Sampling Types  🔗

The options available for sampling methods include DeterministicSampler, DynamicSampler, EMADynamicSampler and RulesBasedSampler.

EMADynamicSampler is recommended for most use cases.

Sampling Example  🔗

Here’s an example of how we sample events from Honeycomb’s ingest service. Since this is a high volume service, we have chosen to use the EMA Dynamic Sampler with a target rate of 1/50 traces.

Here’s what our rules.toml file looks like:

[IngestService] # the name of the dataset we are sampling

    Sampler = "EMADynamicSampler"
    GoalSampleRate = 50
    FieldList = ["request.method","request.path","app.dataset.id","response.status_code","app.team.id"]
    UseTraceLength = false
    AddSampleRateKeyToTrace = true
    AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
    AdjustmentInterval = 60
    MaxKeys = 10000

The most important fields in this example are GoalSampleRate and FieldList. Our goal sample rate aims to keep 1 out of every 50 traces seen. This rate is used by the EMA Dynamic Sampler, which assigns a sample rate for each trace based on the sampling key generated by the fields in FieldList. A useful FieldList selection will therefore have consistent values for high frequency boring traffic and unique values for outliers and interesting traffic. For example, we’ve included response.status_code in the field list in addition to the http endpoint (represented here by request.method and request.path), because it allows us to clearly see when there is failing traffic to any endpoint.

We’ve chosen not to UseTraceLength, which adds the number of spans in the trace to the sampling key. For our ingest service, trace length is not a useful indicator of which types of events we’d like to see sampled.

The AddSampleRateKeyToTrace config fields we’ve enabled are convenience fields to help us understand why the sampler made specific decisions. Examining these fields in your data in Honeycomb may help you decide which fields to add to your FieldList config option going forward.

The AdjustmentInterval field defaults to 15 seconds, and determines how often the moving average used by the sampler is adjusted. We’ve chosen to increase this value to 60 seconds, as it’s not necessary for us to evaluate changes more often.

By setting MaxKeys, we’ve chosen to limit the number of distinct keys tracked by the EMA Dynamic Sampler. We use this field to keep the sample rate map size from spiraling out of control.

You can read more about all the configuration options for the EMA Dynamic Sampler in the section below.

Dynamic Sampling  🔗

This strategy aims for the target sample rate, weighting rare traffic and frequent traffic differently so as to end up with the correct average. Frequent traffic is sampled more heavily, while rarer events are kept or sampled at a lower rate. Use this strategy to keep high-resolution data about unusual events while maintaining a representative sample of your application’s overall behavior.

Briefly described, you configure Refinery to examine the trace for a set of fields. For example, request.status_code and request.method. It collects all the values found in those fields anywhere in the trace - for example, “200” and “GET” - together into a key that it hands to the dynsampler. The dynsampler code will look at the frequency that key appears during the previous 30 seconds (or other value set by the ClearFrequencySec setting) and use that to hand back a desired sample rate. More frequent keys are sampled more heavily, so that an even distribution of traffic across the keyspace is represented in Honeycomb.

By selecting fields well, you can drop significant amounts of traffic while still retaining good visibility into the areas of traffic that interest you. For example, if you want to make sure you have a complete list of all URL handlers invoked, you would add the URL, or a normalized form, as one of the fields to include. Be careful in your selection though, because if the combination of fields creates a unique key each time, you won’t sample out any traffic. Because of this it is not effective to use fields that have unique values, like a UUID, as one of the sampling fields. Each field included should ideally have values that appear many times within any given 30 second window in order to effectively turn in to a sample rate.

To see how this differs from random sampling in practice, consider a simple web service with the following characteristics: 90% of traffic is served correctly and returns a 200 response code. The remaining 10% of traffic is divided into a mix of 40x and 50x responses. If we sample events randomly, we can see these characteristics. We can do analysis of aggregates such as: what’s the average duration of an event, breaking down on fields like status code, endpoint, customer_id, etc. At a high level, we can still learn a lot about our data from a completely random sample. But what about those 50x errors? Typically, we’d like to look at these errors in high resolution - they might all have different causes, or affect only a subset of customers. Discarding them at the same rate that we discard events describing healthy traffic is unfortunate - the errors are much more interesting! Here’s where dynamic sampling can help.

Dynamic sampling will adjust the sample rate of traces and events based on their frequency. To achieve the target sample rate, it will increase sampling on common events, while lowering the sample rate for less common events, all the way down to 1 and keeping unique events.

Dynamic Sampler Configuration  🔗

The dynamic sampler configuration has the following fields:

SampleRate
The goal rate at which to sample. It indicates a ratio, where one sample trace is kept for every n traces seen. For example, a SampleRate of 30 will keep 1 out of every 30 traces. This rate is handed to the dynamic sampler, which assigns a sample rate for each trace based on the fields selected from that trace. Eligible for live reload.
ClearFrequencySec
Determines the period over which the sample rate is calculated. This setting defaults to 30. Eligible for live reload.
FieldList
A list of all the field names to use to form the key that will be handed to the dynamic sampler. The cardinality of the combination of values from all of these keys should be reasonable in the face of the frequency of those keys. If the combination of fields in these keys essentially makes them unique, the dynamic sampler will do no sampling. If the keys have too few values, you won’t get samples of the most interesting traces. A good key selection will have consistent values for high frequency boring traffic and unique values for outliers and interesting traffic. Including an error field (or something like HTTP status code) is an excellent choice. Field names may come from any span in the trace. Eligible for live reload.
UseTraceLength
When set to true, this field adds the number of spans in the trace in to the dynamic sampler as part of the key. The number of spans is exact, so if there are normally small variations in trace length you may want to leave this off. If traces are consistent lengths and changes in trace length is a useful indicator of traces you’d like to see in Honeycomb, set this to true. Eligible for live reload.
AddSampleRateKeyToTrace
When set to true, the sampler will add a field to the root span of the trace containing the key used by the sampler to decide the sample rate. This can be helpful in understanding why the sampler is making certain decisions about sample rate and help you understand how to better choose the sample rate key (aka the FieldList setting above) to use.
AddSampleRateKeyToTraceField
The name of the field that the sampler will use when adding the sample rate key to the trace. This setting is only used when AddSampleRateKeyToTrace is set to true.

EMA Dynamic Sampling  🔗

The Exponential Moving Average (EMA) Dynamic Sampler is an improvement upon DynamicSampler and is recommended for most use cases. Based on the DynamicSampler implementation, EMADynamicSampler differs in that rather than compute rate based on a periodic sample of traffic, it maintains an Exponential Moving Average of counts seen per key, and adjusts this average at regular intervals. The weight applied to more recent intervals is defined by weight, as a number between 0 and 1. Larger values weight the average more toward recent observations. In other words, a larger weight will cause sample rates more quickly adapt to traffic patterns, while a smaller weight will result in sample rates that are less sensitive to bursts or drops in traffic and thus more consistent over time.

Keys that are not found in the Exponential Moving Average will always have a sample rate of 1. Keys that occur more frequently will be sampled on a logarithmic curve. In other words, every key will be represented at least once in any given window. More frequent keys will have their sample rate increased proportionally to wind up with the goal sample rate.

EMADynamicSampler Configuration  🔗

The EMADynamicSampler configuration has the following fields:

GoalSampleRate
The goal rate at which to sample. It indicates a ratio, where one sample trace is kept for every n traces seen. For example, a GoalSampleRate of 30 will keep 1 out of every 30 traces. This rate is handed to the dynamic sampler, which assigns a sample rate for each trace based on the fields selected from that trace. Eligible for live reload.
FieldList
A list of all the field names to use to form the key that will be handed to the dynamic sampler. The cardinality of the combination of values from all of these keys should be reasonable in the face of the frequency of those keys. If the combination of fields in these keys essentially makes them unique, the dynamic sampler will do no sampling. If the keys have too few values, you won’t get samples of the most interesting traces. A good key selection will have consistent values for high frequency boring traffic and unique values for outliers and interesting traffic. Including an error field (or something like HTTP status code) is an excellent choice. Field names may come from any span in the trace. Eligible for live reload.
UseTraceLength
When set to true, this field adds the number of spans in the trace into the dynamic sampler as part of the key. The number of spans is exact. So if there are normally small variations in trace length, you may want to leave this off or set to false. If traces are consistent lengths and changes in trace length is a useful indicator of traces that you’d like to see in Honeycomb, set this to true. Eligible for live reload.
AddSampleRateKeyToTrace
When set to true, the sampler will add a field to the root span of the trace containing the key used by the sampler to decide the sample rate. This can be helpful in understanding why the sampler is making certain decisions about sample rate and help you understand how to better choose the sample rate key (aka the FieldList setting above) to use.
AddSampleRateKeyToTraceField
The name of the field the sampler will use when adding the sample rate key to the trace. This setting is only used when AddSampleRateKeyToTrace is set to true.
AdjustmentInterval
Defines how often (in seconds) we adjust the moving average from recent observations. The default is 15s. Eligible for live reload.

Weight: a value between 0 and 1 indicating the weighting factor used to adjust the EMA. With larger values, newer data will influence the average more, and older values will be factored out more quickly. In mathematical literature concerning EMA, this is referred to as the alpha constant. The default is 0.5. Eligible for live reload.

MaxKeys
If set to a number greater than 0, this field limits the number of distinct keys tracked in EMA. Once MaxKeys is reached, new keys will not be included in the sample rate map, but existing keys will continue to be be counted. You can use this to keep the sample rate map size under control. Eligible for live reload.
AgeOutValue
Indicates the threshold for removing keys from the EMA. The EMA of any key will approach 0 if it is not repeatedly observed, but will never truly reach it. We have to decide what constitutes “zero” with the AgeOutValue field. Keys with averages below this threshold will be removed from the EMA. The default for this value is the same default as weight, since this prevents a key with the smallest integer value (1) from being aged out immediately. This value should generally be less than or equal to (<=) weight, unless you have very specific reasons to set it higher. Eligible for live reload.
BurstMultiple
If set, this field value is multiplied by the sum of the running average of counts to define the burst detection threshold. If total counts observed for a given interval exceed the threshold, EMA is updated immediately rather than waiting on the AdjustmentInterval. Using a negative value disables this field. With the default of 2, if your traffic suddenly doubles, burst detection will kick in. Eligible for live reload.
BurstDetectionDelay
Indicates the number of intervals to run after Start is called before burst detection kicks in. Defaults to 3. Eligible for live reload.

Rule-Based Sampling  🔗

This strategy allows you to define sampling rates explicitly based on the contents of your traces. Using a filter language that is similar to what you see when running queries, you can define conditions on fields across all spans in your trace. For instance, if your root span has a status_code field, and the span wrapping your database call has an error field, you can define a condition that must be met on both fields, even though the two fields are technically separate events. You can supply a sample rate to use when a match is found, or optionally drop all events in that category. Some examples of rules you might want to specify:

  • Drop all traces for your load balancer’s health-check endpoint
  • Keep all traces where the status code was 50x (sample rate of 1)
  • Keep all traces where status code was 200 but database duration was greater than (>) 500ms
  • Keep all traces for a specific customer id while sampling the rest of your traffic at 1 in 100 traces

Rules are evaluated in order, and the first rule that matches is used. For this reason, define more specific rules at the top of the list of rules, and broader rules at the bottom. The conditions making up a rule are combined and must all evaluate to true for the rule to match. If no rules match, a configurable default sampling rate is applied.

Rule-Based Sampling Configuration  🔗

Here’s an example of a series of rules defined for a specific dataset:

    [[dataset4.rule]]
        name = "drop healthchecks"
        drop = true
        [[dataset4.rule.condition]]
            field = "http.route"
            operator = "="
            value = "/health-check"

    [[dataset4.rule]]
        name = "500 errors or slow"
        SampleRate = 1
        [[dataset4.rule.condition]]
            field = "status_code"
            operator = "="
            value = 500
        [[dataset4.rule.condition]]
            field = "duration_ms"
            operator = ">="
            value = 1000.789

    [[dataset4.rule]]
        name = "dynamic sample 200 responses"
        [[dataset4.rule.condition]]
            field = "status_code"
            operator = "="
            value = 200
        [dataset4.rule.sampler.EMADynamicSampler]
            Sampler = "EMADynamicSampler"
            GoalSampleRate = 15
            FieldList = ["request.method", "request.route"]
            AddSampleRateKeyToTrace = true
            AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"

    [[dataset4.rule]]
        SampleRate = 10 # default when no rules match, if missing defaults to 10

Each rule has an optional name field, a SampleRate or sampler, and may include one or more condition. Use SampleRate to apply a static sample rate to traces that qualify for the given rule. Use a secondary sampler to apply a dynamic sample rate to traces that qualify for the given rule.

The sampling rate is determined in the following order:

  1. Use a secondary sampler, if defined
  2. Use the SampleRate field, which must not be less than 1
  3. If drop = true is specified, then the trace will be omitted
  4. A default sample rate of 1

Each condition in a rule consists of the following:

  • the field within your spans that you would like to sample on
  • the value which you are comparing the field to
  • the operator which you are using to compare the field to the value

Here’s a few examples of how sampling decisions would be made according to the rules in the above configuration example:

  • If a trace had a span with a http.route field that was equal to /health-check, then that trace would be dropped.
  • If a trace had a span with a status_code field that was equal to 500 and another span with a duration_ms field less than 1000.789, then that trace would fall through to the last configured rule, and thus would be sampled at a rate of 1 out of 10.
  • If a trace had a span with a status_code field that was equal to 500 and another span with a duration_ms field greater than 1000.789, then it would match the second rule and would be kept, because that rule has a sampleRate of 1.
  • If a trace had a span with a status_code field of 200, then that trace would match the third rule and be delegated to the secondary EMADynamicSampler sampler to determine the sample rate.
  • If a trace had a span with a status_code field of 400, then that trace would fall through to the last configured rule, and thus would be sampled at a rate of 1 out of 10.

Using a Secondary Sampler  🔗

A secondary sampler can be specified using the sampler option. You can leverage any DynamicSampler, EMADynamicSampler, or TotalThroughputSampler as a secondary sampler. You need to specify the desired sampler as part of the config option, then include configuration options for the desired sampler. All options for the desired secondary sampler will be available.

Using a secondary sampler enables the precision of rules based sampling to capture important events – for example: error or long requests – with the flexibility of dynamic sampling for higher volume traffic.

Deterministic Sampler  🔗

Deterministic Sampler is the simplest sampling method. It is a static sample rate, choosing traces randomly to either keep or send (at the appropriate rate). It is not influenced by the contents of the trace.

Deterministic Sampler Configuration  🔗

For deterministic sampling, the only field to set is SampleRate in rules.toml. SampleRate indicates a ratio, where one sample trace is kept for every n traces seen. For example, a SampleRate of 30 will keep 1 out of every 30 traces. The choice on whether to keep any specific trace is random, so the rate is approximate. Eligible for live reload.

Validate Sampling Rules  🔗

Run Refinery in Dry Run Mode  🔗

When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic. By enabling dry run mode, all spans in each trace will be marked with the sampling decision in a field called refinery_kept. All traces will be sent to Honeycomb regardless of the sampling decision. You can then run queries in Honeycomb on this field to check your results and verify that the rules are working as intended. Enable dry run mode by adding DryRun = true in your configuration, as noted in rules_complete.toml.

When dry run mode is enabled, the metric trace_send_kept will increment for each trace, and the metric for trace_send_dropped will remain 0, reflecting that we are sending all traces to Honeycomb.

Use Usage Mode in the Query Builder  🔗

It may also be helpful to use the Usage Mode version of the Query Builder to assess your sampling strategy. Since calculations in this mode do not correct for sample rates, you can check how many actual events match each category for a dynamic sampler.