We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.

X

Gatekeeper Tour

Our Gatekeeper Tour shows what it’s like to debug a real production incident with Honeycomb. In this guide, we’ll:

This tour recommends a specific series of interesting queries to use in order to track down the source of the outage. Along the way, please explore at will—this is real production data, with the sorts of wide rows and plentiful metadata that we use internally to answer questions day-to-day.

Start with your service

The example applications below were written to illustrate instrumentation and context propagation in a production application using our Beelines.

Be sure to take a whirl around their READMEs, in which the authors highlight some particularly useful instrumentation patterns.

These repositories are also particularly useful as a reference while exploring the real-life JSON output or the live tour. grep for a particular string to track down custom attributes and see how context is pushed down to and made available in various helper functions.

Work with the JSON output

Download the JSON file here (160MB gzipped). It contains a selection of real production data from our Gatekeeper instances and may be helpful for understanding the output of our example applications as well as understanding the raw data behind our live tour.

If you’d like to load this JSON directly into your own Honeycomb account, keep reading; if not, jump ahead to the next section.

honeytail is our recommended utility for importing a JSON logfile into Honeycomb.

Install the latest version of honeytail by running:

wget -q https://honeycomb.io/download/honeytail/linux/honeytail_1.733_amd64.deb && \
      echo 'bd135df2accd04d37df31aa4f83bd70227666690143829e2866de8086a1491d2  honeytail_1.733_amd64.deb' | sha256sum -c && \
      sudo dpkg -i honeytail_1.733_amd64.deb

The packages install honeytail, its config file /etc/honeytail/honeytail.conf, and some start scripts. The binary is just honeytail, available if you need it in an unpackaged form or for ad-hoc use.

Once you’ve extracted the .json file, invoke honeytail with the --rebase_time argument in order to “rebase” timestamps to load data into the current time window.

Your --writekey, or API key, is accessible on your account page.

honeytail \
  --writekey=YOUR_API_KEY \
  --dataset="Gatekeeper Tour" \
  --file=./honeycomb-gatekeeper-tour.json \
  --parser=json \
  --backfill \
  --rebase_time \ # "Rebase" the JSON file's timestamps onto the current time
  --presampled=meta.sample_rate # Interpret `meta.sample_rate` as the sample rate

That’s it! The data should now live in Honeycomb and be available for querying.

Start exploring

Honeycomb can help locate system issues. This demo walks through a real outage on the Honeycomb servers. This data is an anonymized selection of real events from our servers from that incident.

Launch the Dataset

Your first query

Your pager goes off: your end-to-end tests, which automate and check for a key user flow, are timing out! Let’s see what the slowest 5% of users are experiencing by looking at the 95th percentile (P95).

Click on the Calculate box, enter P95(duration_ms), and click “Run Query”.

Break down your graph into segments

This result looks terrible! Starting about 35 minutes into the data, and continuing for about half an hour, some queries are delayed by up to 20 minutes. That’s really not good. Does this affect everyone? Maybe it’s just one team that’s having the problem.

In the box labelled Break Down, enter app.team.ID and click “Run Query”.

Zoom back out

Breaking down by app.team.ID draws one curve per team. That’s a high cardinality field. It looks like all the teams are affected by this problem. How is this affecting the amount of work our system does?

Click in the box labeled Break Down and hit the blue [x] to remove the breakdown. Then, in the Calculate box , add COUNT.

Visualize the distribution

During this period, Gatekeeper throughput plummeted: these slow requests really messed up our system! One way to dive in a little further is to look at the distribution of the delay in more detail: are we seeing one population of “slow” and one of “fast,” or is it evenly spread?

In Calculate, replace the P95(duration_ms) with a HEATMAP(duration_ms).

Inspect the heatmap

A heatmap shows the distribution of events, for each time interval. Each dot represents one or more events with that duration; brighter spots mean more items fell in that bucket. We can pursue a few different explanations of why this might be.

Validate a hypothesis

By looking at the code, we know that a request’s duration_ms consists of the time necessary to 1) get the schema for a dataset, 2) determine the partition to write to, and 3) write the payload. Do any of these three steps drive the increased latency we’re seeing?

In Calculate, add heatmaps for app.timer.get_schema_dur_ms, app.timer.get_partition_dur_ms, and app.timer.kafka_write_dur_ms.

Start exploring elevated app.timer.get_schema_dur_ms times

It looks like the problem lies almost entirely within app.timer.get_schema_dur_ms! That’s useful news: it seems isolated to our database read. Our servers are hosted in several different zones: is there a chance that one of them is tied to delays in talking to our database?

In Calculate, remove all fields except COUNT and app.timer.get_schema_dur_ms. Add meta.availability_zone to BREAK DOWN.

Hypothesis: are AWS availability zones correlated with increased DB times?

By hovering your mouse over the different rows of the results, it’s pretty clear that all four zones are behaving the same way (and thus are not the problem). Let’s check another angle: hits on the database cache should be fast, but misses should be slow.

Remove meta.availability_zone from BREAK DOWN, and instead put in app.hitSchemaCache.

Hypothesis: elevated get_schema_dur_ms times should be tied to cache misses

Cache hits are fast; the triangle is made of cache misses getting slower and slower over time. Let’s filter to only the cache misses.

Right click on the false row in the app.hitSchemaCache column to filter to only requests that missed the cache.

Drill into a single high-cardinality value to isolate request patterns

When looking at this larger dataset, the data points can smear over each other. Sometimes it’s easier to tell what’s going on if we focus on one dataset. There are a lot ways you can proceed: Honeycob rewards multiple approaches. We can break down, then filter on the dataset.

Add app.dataset.ID to BREAK DOWN, then click “Run Query”. Filter to just app.dataset.ID = 100011, which has an interesting HEATMAP(app.timer.get_schema_dur_ms) pattern.

Narrow the time window by dragging on the graph

There are visible diagonal stripes across the diagram, weird—seems like we might have a constant timeout somewhere, every every five minutes or so. Reviewing the code, we see the cache falls through to the “database” every 5 minutes—and then seems to time out.

Select a time period in the data and press the magnifying glass to zoom in on a time span. Select the time period surrounding the first large spike in the COUNT graph (approximately 38-40 minutes into the data) and press the magnifying glass to zoom in.

View the Raw Data underneath the graph

With the subset of data selected, we can look directly at some rows to understand better what happens in them. There is nothing like looking at a sample of underlying events to understand what’s going on.

Click “Raw Data” to explore them. (Look for the “Raw Data” button directly above the graph on the left.)

Check for errors

If we scroll over to the error and error_detail columns, we see a few rows are showing error 1040. Even though error_detail could be a very high cardinality column, we can even break down by that string!

Add error and error_detail to the BREAK DOWN and click “Run Query”.

Conclusion

The error 1040s are sporadic but consistent with a database that has failed. It looks like we know everything we can learn from this system. At 5:38, the database started processing fewer transactions, and throughput slowed to a trickle. Caches started failing; queries timed out. The database came back after twenty minutes, and anything that hadn’t timed out was processed.

Go beyond the story

While the accessible data tells a specific story of ours, take this opportunity to really explore this dataset! Here are a few different interesting questions to ask on our Gatekeeper Tour: