If using the dataset-only data model, refer to the Honeycomb Classic tab for instructions. Not sure? Learn more about Honeycomb versus Honeycomb Classic.
In rules.toml
, you can specify different sampling methods and specify options for each.
You can specify sampling methods in a hierarchical fashion:
It is not possible to specify different types of samplers for different services within the same environment. This would imply sub-trace sampling, which Refinery does not support.
The default configuration uses the DeterministicSampler
and a SampleRate
of 1, meaning that no traffic will be dropped.
These configurations are set through the root-level Sampler
and SampleRate
fields in the rules configuration.
Sampler
applies to all environments that do not specify their own Sampler
.
SampleRate
applies to all environments that use an applicable Sampler
type and do not specify their own SampleRate
.
To avoid issues, we recommend that every installation specify the environment-specific Sampler
and (if applicable) SampleRate
fields.
Sampling methods and rules are configured in rules.toml
.
See GitHub for an example rules file.
After setting up or modifying sampling rules for your dataset(s), we recommend validating your configuration and doing a Dry Run before dropping your traffic.
In rules.toml
, you can specify different sampling methods and specify options for each.
You can use the same sampling method and rate for all your datasets, or you can define specific sampling strategies and rules for each dataset.
The default configuration uses the DeterministicSampler
and a SampleRate
of 1, meaning that no traffic will be dropped.
These configurations are set through the root-level Sampler
and SampleRate
fields in the rules configuration.
Sampler
applies to all datasets that do not specify their own Sampler
.
SampleRate
applies to all datasets that use an applicable Sampler
type and do not specify their own SampleRate
.
To avoid issues, we recommend that every installation specify the environment-specific Sampler
and (if applicable) SampleRate
fields.
Sampling methods and rules are configured in rules.toml
.
See GitHub for an example rules file.
After setting up or modifying sampling rules for your dataset(s), we recommend validating your configuration and doing a Dry Run before dropping your traffic.
Here is an example of how we sample events from Honeycomb’s ingest service. Since this is a high volume service, we have chosen to use the EMA Dynamic Sampler with a target rate of 1/50 traces.
Here is what our rules.toml
file looks like:
If using the dataset-only data model, refer to the Honeycomb Classic tab for instructions. Not sure? Learn more about Honeycomb versus Honeycomb Classic.
# The name of the environment being sampled.
# Sampling decisions are applied to every dataset within this environment.
[prod]
Sampler = "EMADynamicSampler"
GoalSampleRate = 50
FieldList = ["request.method","request.path","response.status_code"]
UseTraceLength = false
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
AdjustmentInterval = 60
MaxKeys = 10000
Weight = 0.5
It is also possible to define rules that apply only to a single dataset within an environment.
However, it is not possible to define different sampling decisions for different datasets within the same environment. This would imply sub-trace sampling, which Refinery does not support.
See Rule-Based Sampling Configuration for an example.
[IngestService] # the name of the dataset we are sampling
Sampler = "EMADynamicSampler"
GoalSampleRate = 50
FieldList = ["request.method","request.path","response.status_code"]
UseTraceLength = false
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
AdjustmentInterval = 60
MaxKeys = 10000
Weight = 0.5
The most important fields in this example are GoalSampleRate
and FieldList
.
Our goal sample rate aims to keep 1 out of every 50 traces seen.
This rate is used by the EMA Dynamic Sampler, which assigns a sample rate for each trace based on the sampling key generated by the fields in FieldList
.
A useful FieldList
selection will therefore have consistent values for high frequency boring traffic and unique values for outliers and interesting traffic.
For example, we have included response.status_code
in the field list in addition to the http endpoint (represented here by request.method
and request.path
), because it allows us to clearly see when there is failing traffic to any endpoint.
We have chosen not to UseTraceLength
, which adds the number of spans in the trace to the sampling key.
For our ingest service, trace length is not a useful indicator of which types of events we would like to see sampled.
The AddSampleRateKeyToTrace
configuration fields we have enabled are convenience fields to help us understand why the sampler made specific decisions.
Examining these fields in your data in Honeycomb may help you decide which fields to add to your FieldList
configuration option going forward.
The AdjustmentInterval
field defaults to 15
seconds, and determines how often the moving average used by the sampler is adjusted.
We have chosen to increase this value to 60
seconds, as it is not necessary for us to evaluate changes more often.
By setting MaxKeys
, we have chosen to limit the number of distinct keys tracked by the EMA Dynamic Sampler.
We use this field to keep the sample rate map size from spiraling out of control.
Read more about all the configuration options for the EMA Dynamic Sampler.
The options available for sampling methods include DeterministicSampler
, DynamicSampler
, EMADynamicSampler
, RulesBasedSampler
, and TotalThroughputSampler
.
EMADynamicSampler
is recommended for most use cases.
This strategy aims for the target sample rate, weighting rare traffic and frequent traffic differently so as to end up with the correct average. Frequent traffic is sampled at a higher rate, while rarer events are kept or sampled at a lower rate. Use this strategy to keep high-resolution data about unusual events while maintaining a representative sample of your application’s overall behavior.
Briefly described, you configure Refinery to examine the trace for a set of fields.
For example, request.status_code
and request.method
.
It collects all the values found in those fields anywhere in the trace - for example, “200” and “GET” - together into a key that it hands to the dynsampler.
The dynsampler code will look at the frequency that key appears during the previous 30 seconds (or other value set by the ClearFrequencySec
setting) and use that to hand back a desired sample rate.
More frequent keys are sampled at a higher rate, so that an even distribution of traffic across the keyspace is represented in Honeycomb.
By selecting fields well, you can drop significant amounts of traffic while still retaining good visibility into the areas of traffic that interest you. For example, if you want to make sure you have a complete list of all URL handlers invoked, you would add the URL, or a normalized form, as one of the fields to include. Be careful in your selection though, because if the combination of fields creates a unique key each time, you will not drop any traffic. Because of this, it is not effective to use fields that have unique values, like a UUID, as one of the sampling fields. Each field included should ideally have values that appear many times within any given 30 second window in order to effectively turn in to a sample rate.
To see how this differs from random sampling in practice, consider a simple web service with the following characteristics: 90% of traffic is served correctly and returns a 200
response code.
The remaining 10% of traffic is divided into a mix of 40x
and 50x
responses.
If we sample events randomly, we can see these characteristics.
We can do analysis of aggregates such as: what is the average duration of an event, breaking down on fields like status code, endpoint, customer_id, and so on.
At a high level, we can still learn a lot about our data from a completely random sample.
But what about those 50x
errors?
Typically, we would like to look at these errors in high resolution - they might all have different causes, or affect only a subset of customers.
Discarding them at the same rate that we discard events describing healthy traffic is unfortunate - the errors are much more interesting!
Here is where dynamic sampling can help.
Dynamic sampling will adjust the sample rate of traces and events based on their frequency. To achieve the target sample rate, it will increase sampling on common events, while lowering the sample rate for less common events, all the way down to 1 and keeping unique events.
The dynamic sampler configuration has the following fields:
SampleRate
SampleRate
of 30
will keep 1 out of every 30 traces.
This rate is handed to the dynamic sampler, which assigns a sample rate for each trace based on the fields selected from that trace. Eligible for live reload.ClearFrequencySec
30
.
Eligible for live reload.FieldList
k8s.pod.id
), is a bad choice.
If the combination of fields essentially makes them unique, the dynamic sampler will sample everything.
If the combination of fields is not unique enough, you will not be guaranteed samples of the most interesting traces.
As an example, consider a combination of HTTP endpoint (high-frequency and boring), HTTP method, and status code (normally boring but can become interesting when indicating an error) as a good set of fields since it will allowing proper sampling of all endpoints under normal traffic and call out when there is failing traffic to any endpoint.
The configuration for this would look something like FieldList = ["request.method", "http.target", "response.status_code"].
For example, in contrast, consider a combination of HTTP endpoint, status code, and pod id as a bad set of fields, since it would result in keys that are all unique, and therefore results in sampling 100% of traces.
Using only the HTTP endpoint field would be a bad choice, as it is not unique enough and therefore interesting traces, like traces that experienced a 500
, might not be sampled.
Field names may come from any span in the trace.
Eligible for live reload.UseTraceLength
true
, this field adds the number of spans in the trace in to the dynamic sampler as part of the key.
The number of spans is exact, so if there are normally small variations in trace length you may want to leave this off.
If traces are consistent lengths and changes in trace length is a useful indicator of traces you would like to see in Honeycomb, set this to true
.
Eligible for live reload.AddSampleRateKeyToTrace
true
, the sampler will add a field to the root span of the trace containing the key used by the sampler to decide the sample rate.
This can be helpful in understanding why the sampler is making certain decisions about sample rate and help you understand how to better choose the sample rate key, also known as the FieldList
setting above, to use.AddSampleRateKeyToTraceField
AddSampleRateKeyToTrace
is set to true
.The Exponential Moving Average (EMA) Dynamic Sampler is an improvement upon DynamicSampler
and is recommended for most use cases.
Based on the DynamicSampler
implementation, EMADynamicSampler
differs in that rather than compute rate based on a periodic sample of traffic, it maintains an Exponential Moving Average of counts seen per key, and adjusts this average at regular intervals.
The weight applied to more recent intervals is defined by weight
, as a number between 0
and 1
.
Larger values weight the average more toward recent observations.
In other words, a larger weight will cause sample rates more quickly adapt to traffic patterns, while a smaller weight will result in sample rates that are less sensitive to bursts or drops in traffic and thus more consistent over time.
Keys that are not found in the Exponential Moving Average will always have a sample rate of 1
.
Keys that occur more frequently will be sampled on a logarithmic curve.
In other words, every key will be represented at least once in any given window.
More frequent keys will have their sample rate increased proportionally to wind up with the goal sample rate.
The EMADynamicSampler configuration has the following fields:
GoalSampleRate
GoalSampleRate
of 30
will keep 1 out of every 30 traces.
This rate is handed to the dynamic sampler, which assigns a sample rate for each trace based on the fields selected from that trace.
Eligible for live reload.FieldList
k8s.pod.id
), is a bad choice.
If the combination of fields essentially makes them unique, the dynamic sampler will sample everything.
If the combination of fields is not unique enough, you will not be guaranteed samples of the most interesting traces.
As an example, consider a combination of HTTP endpoint (high-frequency and boring), HTTP method, and status code (normally boring but can become interesting when indicating an error) as a good set of fields since it will allowing proper sampling of all endpoints under normal traffic and call out when there is failing traffic to any endpoint.
The configuration for this would look something like FieldList = ["request.method", "http.target", "response.status_code"].
For example, in contrast, consider a combination of HTTP endpoint, status code, and pod id as a bad set of fields, since it would result in keys that are all unique, and therefore results in sampling 100% of traces.
Using only the HTTP endpoint field would be a bad choice, as it is not unique enough and therefore interesting traces, like traces that experienced a 500
, might not be sampled.
Field names may come from any span in the trace.
Eligible for live reload.UseTraceLength
true
, this field adds the number of spans in the trace into the dynamic sampler as part of the key.
The number of spans is exact.
So if there are normally small variations in trace length, you may want to leave this field off or set to false
.
If traces are consistent lengths and changes in trace length is a useful indicator of traces that you would like to see in Honeycomb, set this to true
.
Eligible for live reload.AddSampleRateKeyToTrace
true
, the sampler will add a field to the root span of the trace containing the key used by the sampler to decide the sample rate.
This can be helpful in understanding why the sampler is making certain decisions about sample rate and help you understand how to better choose the sample rate key, also known as the FieldList
setting above, to use.AddSampleRateKeyToTraceField
AddSampleRateKeyToTrace
is set to true
.AdjustmentInterval
15s
. Eligible for live reload.Weight
0
and 1
indicating the weighting factor used to adjust the EMA.
With larger values, newer data will influence the average more, and older values will be factored out more quickly.
In mathematical literature concerning EMA, this is referred to as the alpha
constant.
The default is 0.5
. Eligible for live reload.MaxKeys
0
, this field limits the number of distinct keys tracked in EMA.
Once MaxKeys
is reached, new keys will not be included in the sample rate map, but existing keys will continue to be be counted.
You can use this to keep the sample rate map size under control.
Eligible for live reload.AgeOutValue
0
if it is not repeatedly observed, but will never truly reach it.
We have to decide what constitutes “zero” with the AgeOutValue
field.
Keys with averages below this threshold will be removed from the EMA.
The default for this value is the same default as weight
, since this prevents a key with the smallest integer value (1
) from being aged out immediately.
This value should generally be less than or equal to (<=) weight
, unless you have very specific reasons to set it higher.
Eligible for live reload.BurstMultiple
AdjustmentInterval
.
Using a negative value disables this field.
With the default of 2
, if your traffic suddenly doubles, burst detection will kick in.
Eligible for live reload.BurstDetectionDelay
3
.
Eligible for live reload.This strategy allows you to define sampling rates explicitly based on the contents of your traces.
Using a filter language that is similar to what you see when running queries, you can define conditions on fields across all spans in your trace.
For instance, if your root span has a status_code
field, and the span wrapping your database call has an error
field, you can define a condition that must be met on both fields, even though the two fields are technically separate events.
You can supply a sample rate to use when a match is found, or optionally drop all events in that category.
Some examples of rules you might want to specify:
200
but database duration was greater than (>) 500ms
Rules are evaluated in order, and the first rule that matches is used. For this reason, define more specific rules at the top of the list of rules, and broader rules at the bottom. The conditions making up a rule are combined and must all evaluate to true for the rule to match. If no rules match, a configurable default sampling rate is applied.
If using the dataset-only data model, refer to the Honeycomb Classic tab for instructions. Not sure? Learn more about Honeycomb versus Honeycomb Classic.
Rules apply to all datasets within that environment. Here is an example that specifies several rules for different services in an environment.
# 'prod' is the name of the environment
[prod]
Sampler = "RulesBasedSampler"
# 'prod.rule' is how you specify an environment-wide rule
# This drops all healthchecks across an environment.
[[prod.rule]]
name = "drop healthchecks"
drop = true
[[prod.rule.condition]]
field = "http.route"
operator = "="
value = "/health-check"
# This keeps all slow 500 errors across an environment.
[[prod.rule]]
name = "keep slow 500 errors"
SampleRate = 1
[[prod.rule.condition]]
field = "status_code"
operator = "="
value = 500
[[prod.rule.condition]]
field = "duration_ms"
operator = ">="
value = 1000.789
# This dynamically samples all 200 responses across an environment.
[[prod.rule]]
name = "dynamically sample 200 responses"
[[prod.rule.condition]]
field = "status_code"
operator = "="
value = 200
[prod.rule.sampler.EMADynamicSampler]
Sampler = "EMADynamicSampler"
GoalSampleRate = 15
FieldList = ["request.method", "request.route"]
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
[[prod.rule]]
SampleRate = 10 # default when no rules match, if missing defaults to 10
It is possible to define rules scoped only to a single service dataset within an environment. Here is an example:
# 'prod' is the name of the environment
[prod]
Sampler = "RulesBasedSampler"
# This rule applies to a single service using two rules, one to scope it to
# a specific dataset, and another to drop traffic from a specific route.
[[prod.rule]]
name = "drop healthchecks"
drop = true
[[prod.rule.condition]]
field = "http.route"
operator = "="
value = "/health-check"
[[prod.rule.condition]]
field = "service.name"
operator = "="
value = "/service1"
Here is an example of a series of rules defined for a specific dataset:
[dataset]
Sampler = "RulesBasedSampler"
[[dataset4.rule]]
name = "drop healthchecks"
drop = true
[[dataset4.rule.condition]]
field = "http.route"
operator = "="
value = "/health-check"
[[dataset4.rule]]
name = "keep slow 500 errors"
SampleRate = 1
[[dataset4.rule.condition]]
field = "status_code"
operator = "="
value = 500
[[dataset4.rule.condition]]
field = "duration_ms"
operator = ">="
value = 1000.789
[[dataset4.rule]]
name = "dynamically sample 200 responses"
[[dataset4.rule.condition]]
field = "status_code"
operator = "="
value = 200
[dataset4.rule.sampler.EMADynamicSampler]
Sampler = "EMADynamicSampler"
GoalSampleRate = 15
FieldList = ["request.method", "request.route"]
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
[[dataset4.rule]]
SampleRate = 10 # default when no rules match, if missing defaults to 10
Each rule has an optional name
field, a SampleRate
or sampler
, and may include one or more condition
.
Use SampleRate
to apply a static sample rate to traces that qualify for the given rule.
Use a secondary sampler
to apply a dynamic sample rate to traces that qualify for the given rule.
The sampling rate is determined in the following order:
sampler
, if definedSampleRate
field, which must not be less than 1
drop = true
is specified, then the trace will be omitted1
Each condition in a rule consists of the following:
field
within your spans that you would like to sample onvalue
which you are comparing the field tooperator
which you are using to compare the field
to the value
datatype
parameter that coerces the field
to match a specified typeThe datatype
parameter is optional and must be one of the following:
"int"
"float"
"string"
"bool"
The datatype
parameter is helpful to let a rule handle multiple fields that come in as different data types.
For example, it can be common that an http.status_code
field comes in as either a string or an integer from different systems.
Instead of writing the same rule twice, you can write it once and use the datatype
parameter to coerce the field to the same type.
Condition operators:
exists
- does the field existnot-exists
- does the field not exist!=
- is the value of the field not equal to the value in the rule=
- is the value of the field equivalent to the value in the rule>
- is the value of the field greater than the value in the rule>=
- is the value of the field greater than or equal to the value in the rule<
- is the value of the field less than the value in the rule>=
- is the value of the field less than or equal to the value in the rulestarts-with
- returns true if the field starts with the string defined in the value
contains
- returns true if the field contains the string defined in the value
does-not-contain
- returns true if the field does not contain the string defined in the value
Notes about operators:
http.status_code
as an Integer, and some might return the value as a String. To account for this, two rules would be required: one rule that compares String values and one rule that compares Integer valuesHere are a few examples of how sampling decisions would be made according to the rules in the above configuration example:
http.route
field that was equal to /health-check
, then that trace would be dropped.status_code
field that was equal to 500
and another span with a duration_ms
field less than 1000.789, then that trace would fall through to the last configured rule, and thus would be sampled at a rate of 1 out of 10
.status_code
field that was equal to 500
and another span with a duration_ms
field greater than 1000.789
, then it would match the second rule and would be kept, because that rule has a sampleRate
of 1
.status_code
field of 200
, then that trace would match the third rule and be delegated to the secondary EMADynamicSampler
sampler to determine the sample rate.status_code
field of 400
, then that trace would fall through to the last configured rule, and thus would be sampled at a rate of 1 out of 10
.Rules comparisons take the datatype of the fields into account.
In particular, a rule that compares status_code
to 200
(an integer) will fail if the status code is actually "200"
(a string), and vice-versa.
If you are working in a mixed environment where either one may be included in the telemetry, you should create a separate rule for each case.
A secondary sampler can be specified using the sampler
option.
You can leverage any DynamicSampler
, EMADynamicSampler
, or TotalThroughputSampler
as a secondary sampler.
You need to specify the desired sampler as part of the configuration option, then include configuration options for the desired sampler.
All options for the desired secondary sampler will be available.
Using a secondary sampler enables the precision of rules based sampling to capture important events – for example: error or long requests – with the flexibility of dynamic sampling for higher volume traffic.
This strategy attempts to meet a goal throughput rate of a fixed number of spans, not traces, per second per Refinery node.
This strategy is most useful if you need to quickly get event volume under control, or if your traces are fairly uniform and a consistent volume of events is preferred.
It performs poorly when the active keyspace is very large, so ideally the number of active keys should be be less than 10*GoalThroughputPerSec
.
Sample rates are still calculated and set on the spans, but they are a function of the number of events seen for a key in a given window, as defined by ClearFrequencySec
.
GoalThroughputPerSec
100
, must be greater than 0. Eligible for live reload.ClearFrequencySec
30
.
Eligible for live reload.FieldList
UseTraceLength
true
, this field adds the number of spans in the trace into the dynamic sampler as part of the key.
The number of spans is exact.
So if there are normally small variations in trace length, you may want to leave this off or set to false
.
If traces are consistent lengths and changes in trace length is a useful indicator of traces that you would like to see in Honeycomb, set this to true
.
Eligible for live reload.AddSampleRateKeyToTrace
true
, the sampler will add a field to the root span of the trace containing the key used by the sampler to decide the sample rate.
This field can be helpful in understanding why the sampler is making certain decisions about sample rate and help you understand how to better choose the sample rate key, also known as the FieldList
setting above, to use.AddSampleRateKeyToTraceField
AddSampleRateKeyToTrace
is set to true
.If using the dataset-only data model, refer to the Honeycomb Classic tab for instructions. Not sure? Learn more about Honeycomb versus Honeycomb Classic.
Here is an example TotalThroughputSampler configuration applied to an environment:
[prod]
Sampler = "TotalThroughputSampler"
GoalThroughputPerSec = 500
ClearFrequencySec = 30
FieldList = ["http.status_code"]
UseTraceLength = false
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
It is not possible to use this sampler with different configuration values for different datasets within the same environment.
Here is an example TotalThroughputSampler configuration for a given dataset:
[audit-service] # the name of the dataset we are sampling
Sampler = "TotalThroughputSampler"
GoalThroughputPerSec = 500
ClearFrequencySec = 30
FieldList = ["http.status_code"]
UseTraceLength = false
AddSampleRateKeyToTrace = true
AddSampleRateKeyToTraceField = "meta.refinery.dynsampler_key"
Deterministic Sampler is the simplest sampling method. It is a static sample rate, choosing traces randomly to either keep or send (at the appropriate rate). It is not influenced by the contents of the trace.
For deterministic sampling, the only field to set is SampleRate
in rules.toml
.
SampleRate
indicates a ratio, where one sample trace is kept for every n
traces seen.
For example, a SampleRate
of 30
will keep 1 out of every 30 traces.
The choice on whether to keep any specific trace is random, so the rate is approximate. Eligible for live reload.
When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic.
By enabling dry run mode, all spans in each trace will be marked with the sampling decision in a field called refinery_kept
.
All traces will be sent to Honeycomb regardless of the sampling decision.
You can then run queries in Honeycomb on this field to check your results and verify that the rules are working as intended.
Enable dry run mode by adding DryRun = true
in your configuration, as noted in rules_complete.toml.
When dry run mode is enabled, Refinery will set the meta.dryrun.sample_rate
attribute on spans.
This attribute allows you to inspect what the sample rate will be without sampling your data.
When dry run mode is enabled, the metric trace_send_kept
will increment for each trace, and the metric for trace_send_dropped
will remain 0, reflecting that we are sending all traces to Honeycomb.
Refinery can send telemetry that includes information that can help debug the sampling decisions that are made.
To enable it, in the config file, set AddRuleReasonToTrace
to true
.
Traces sent to Honeycomb will then include the field meta.refinery.reason
.
This field contains text that indicates the rule that caused the trace to be included.
It may also be helpful to use the “Usage Mode” version of the Query Builder to assess your sampling strategy. Since calculations in this mode do not correct for sample rates, you can check how many actual events match each category for a dynamic sampler.
Did you find what you were looking for?