We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.

X

Sampling traces

Sampling your data is a great way to increase your retention and get data volume to a manageable size.

Many folks are curious about how sampling works with tracing, given that simply sampling 1/N requests at random will not guarantee that you retain all of the spans for a given trace. The story of how sampling and tracing fit together with Honeycomb is still evolving, but here are some thoughts on how to approach it.

Traditionally, the way traces are sampled is head-based sampling: when the root span is being processed, a random sampling decision is made (e.g., if randint(10) == 0, the span will be sampled). If that span is decided to be sampled, it gets sent and propagates that decision out to the descendent spans, who follow suit, usually by a method like HTTP header (something like X-B3-Sampled: 1). That way, all the spans for a particular trace are preserved. Our integrations do not support head-based sampling today out of the box, but you could implement such a system yourself.

Some of our integrations do support what we call deterministic sampling. In deterministic sampling, a hash is made of a specific field in the event/span such as the request ID, and a decision to sample is made based on that hash and the intended sample rate. Hence, an approximately correct number of traces will be selected and the decision whether or not to sample a given trace does not need to be propagated around: actors can sample full traces whether they can communicate or not.

There is another option: tail-based sampling, where sampling decisions are made when the full trace information has been gathered. This ensures that if an error or slowness happens way down the tree of service calls, the full events of the trace are more likely to get sampled in. To use this method of sampling, all spans must be collected at a buffer ahead of time.

Client-based Sampling (at send time)

Deterministic Sampling

The Honeycomb Beelines provide out-of-box support for deterministic sampling of traces. By default, when a sampling rate is configured, the trace ID of each trace-enabled event (span) will be be used to compute whether the event should be kept. All events that share the same trace ID will receive the same sampling decision. If you propagate trace context to other services, events originating from those services will also receive the same sampling decision if they are instrumented with a Beeline.

    beeline.Init(beeline.Config{
        WriteKey:   "<span class="write-key-fill">YOUR_API_KEY</span>",
        Dataset:    "MyGoApp",
        Debug:      true,
        SampleRate: 10,
    })

For more information, see the Go Beeline docs.

require("honeycomb-beeline")({
  writeKey: "YOUR_API_KEY",
  dataset: "my-dataset-name",
  // deterministic sampling enabled at a rate of 10
  // i.e. keep 1 in every 10 traces
  sampleRate: 10
  /* ... additional optional configuration ... */
});

For more information, see the Nodejs Beeline docs.

import io.honeycomb.beeline.tracing.Beeline;
import io.honeycomb.beeline.tracing.Span;
import io.honeycomb.beeline.tracing.SpanBuilderFactory;
import io.honeycomb.beeline.tracing.SpanPostProcessor;
import io.honeycomb.beeline.tracing.Tracer;
import io.honeycomb.beeline.tracing.Tracing;
import io.honeycomb.beeline.tracing.sampling.Sampling;
import io.honeycomb.libhoney.HoneyClient;
import io.honeycomb.libhoney.LibHoney;

public class TracerSpans {
private static final String WRITE_KEY = "test-write-key";
private static final String DATASET = "test-dataset";

    private static final HoneyClient client;
    private static final Beeline beeline;

    static {
        client                          = LibHoney.create(LibHoney.options().setDataset(DATASET).setWriteKey(WRITE_KEY).build());
        // deterministic sampling enabled at a rate of 10
        // i.e. keep one in 10 traces
        SpanPostProcessor postProcessor = Tracing.createSpanProcessor(client, Sampling.DeterministicTraceSampler(10));
        SpanBuilderFactory factory      = Tracing.createSpanBuilderFactory(postProcessor, Sampling.DeterministicTraceSampler(10));
        Tracer tracer                   = Tracing.createTracer(factory);
        beeline                         = Tracing.createBeeline(tracer, factory);
    }

For more information (for example, using the Beeline with Spring), see the Java Beeline docs.

beeline.init(
   writekey='<span class="write-key-fill">YOUR_API_KEY</span>',
   dataset='my-app',
   service_name='my-app',
   debug=True,
   # deterministic sampling enabled at a rate of 10
   # i.e. keep one in 10 traces
   sample_rate=10,
)

For more information, see the Python Beeline docs.

require 'honeycomb-beeline'

Honeycomb.init(
  # deterministic sampling enabled at a rate of 10
  # i.e. keep one in 10 traces
  sample_rate: 10
)

For more information, see the Ruby Beeline docs.

Honeycomb Refinery (at ingestion time)

Sampling at the client has several benefits. Sending less traffic reduces the amount of Honeycomb ingestion you need to pay for, and reduces the CPU and network resources associated with sending events. It is limited, however, because sampling decisions cannot be made on the content of the overall trace. Here are some scenarios that could benefit from smarter sampling:

With Honeycomb Refinery, you can apply sampling strategies to entire traces rather than individual events. This feature is currently in closed beta - please email solutions@honeycomb.io if you would like to know more.

How it works

Refinery is enabled on a dataset, with one of two possible refinery strategies (see below). After enabling Refinery, Honeycomb servers begin buffering spans (events with a Trace ID). As events come in, they are grouped by Trace ID. When the root span arrives (or a timeout occurs), the complete trace is assembled from the available spans, and the selected strategy is applied. Refinery makes a sampling decision based on the data available at the time and the specified configuration. This sampling decision is recorded for a period of time, allowing spans that arrive after the sampling decision is made to also receive the same sampling decision.

Dynamic Refinery

This strategy delivers a given sample rate, weighting rare traffic and frequent traffic differently so as to end up with the correct average. Frequent traffic is sampled more heavily, while rarer events are kept or sampled at a lower rate. This is the strategy you want to use if you are concerned about keeping high-resolution data about unusual events while maintaining an a representative sample of your application’s behavior.

To see how this differs from random sampling in practice, consider a simple web service with the following characteristics:

If we sample events randomly, we can see these characteristics. We can do analysis of aggregates such as: what’s the average duration of an event, breaking down on fields like status code, endpoint, customer_id, etc. At a high level, we can still learn a lot about our data from a completely random sample. But what about those 50x errors? Typically, we’d like to look at these in high resolution - they might all have different causes, or affect only a subset of customers. Discarding them at the same rate that we discard events describing healthy traffic is unfortunate - they are much more interesting! Here’s where dynamic sampling can help.

With Dynamic Refinery, you set a target sampling rate. Dynamic Refinery will try to maintain that sampling rate, but will adjust the sample rate of traces and events events based on their frequency. To achieve the target sample rate, it will increase sampling on common events, while lowering the sample rate for less common events, all the way down to 1 (keeping unique events).

Let’s look at how random sampling vs dynamic sampling compare for our hypothetical web app:

Dynamic Sampling Comparison - Ingest

As the illustration above demonstrates, random sampling will result in the rarer 50x events getting tossed out at the same rate as the much more common 200 traffic. With dynamic sampling, we sample the common traffic at a higher rate, while preserving more of the 50x events. Effectively, we’re trading some high-resolution data for common events in exchange for high-resolution data on rarer events.

Now what happens when you look at them in aggregate?

Dynamic Sampling Comparison - Query

Because the Honeycomb query engine adjusts for sample rate when rendering graphs, we’re able to do the same aggregate analyses of the dataset that we could do before.

Configuring

To enable Dynamic Refinery, click on the Refinery tab in your Dataset’s Settings page.

Dataset Settings - Sampling Tab

Check the box labeled Enable Refinery for this dataset, then choose Dynamic Refinery. Next, you need to choose a target sample rate. This is the target sample rate that the sampler will try to maintain. For example, if you choose 10, Refinery will try to adjust sample rates dynamically so that 1 in 10 of traces are kept.

Dynamic Refinery Configuration

For Fields Sampled, you will need to think about your dataset’s schema, and which fields help categorize your traffic. In our example, we used HTTP status code, but you can supply multiple keys. HTTP status code is interesting, but what if you want to keep events describing the http status codes encountered by individual customers? You could add your customer ID field as well - Dynamic Refinery accepts up to 10 fields. Think about which dimensions make your traffic interesting: request path, build version, error code, etc.

Choosing continuous fields like duration_ms or index (arbitrary and unique) values like “Request ID” will effectively cause every event and trace to be kept, since all traffic will be considered rare. As a rule, if you wouldn’t do a group-by/break-down on it, it is likely not a good candidate for the Dynamic Refinery strategy.

By Example

A large proportion of your traffic is dominated by a few customers, but you also have many smaller customers that you’d like to keep data from. Simple random sampling causes data to be dropped for these smaller customers, and you don’t want to spend lots of money storing all event data just to keep high resolution data about the smaller customers. Dynamic sampling can help here by adjusting the sample rate to be proportional to the traffic coming in from each customer. Here, the field that differentiates traffic is the customer_id field, so we’ll set a dynamic sampling policy with that field.

Dynamic Sampler Example 1

After committing the configuration, we can look at Usage Mode to see the new sampling rates take effect. We can visualize COUNT, HEATMAP(Sample Rate), and AVG(Sample Rate), breaking down on customer_id to observe how sample rates are adjusted proportionally to traffic.

Dynamic Sampler Example 2

Later, you might think: I also care about individual customer errors! Sampling customer traffic randomly, even if at at different rates, could still result in you missing data you care about, like a rare 500 error. Let’s add status_code as another field in our Dynamic Refinery key.

Dynamic Sampler Example 3

You may have noticed the Add Dynamic Sampling Key box we’ve checked. When checked, this adds a new field to your dataset - meta.dynamic_sampling_key - which shows you the key value used when sampling the traffic. This key is assembled based on the fields you’ve selected in the Dynamic Sampling Strategy configuration. You can use this to understand which sampling rates are being used for each category of traffic.

Dynamic Sampler Example 4

Rule-based Refinery

The Rule-based Refinery strategy allows you to define sampling rates explicitly based on the contents of your traces. Using a filter language that is similar to what you see when running queries, you can define conditions on fields across all spans in your trace. For instance, if your root span has a status_code field, and the span wrapping your database call has an error field, you can define a condition that must be met on both fields, even though the two fields are technically separate events. You can supply a sample rate to use when a match is found, or optionally drop all events in that category. Some examples of rules you might want to specify:

Rules are evaluated in order, and the first match is used. For this reason, define more specific rules at the top of the list of rules, and broader rules at the bottom. If no rules match, a configurable default sampling rate is applied.

By Example

You have a large amount of event data, and want to reduce your ingestion by 90%. You also have some specific types of data that you don’t need at all, and some that you absolutely do not want to miss:

Rule-based Sampler Example 1

Limitations