Our Gatekeeper Tour shows what it’s like to debug a real production incident with Honeycomb. In this guide, we’ll:
Gatekeeperis the name of our API server, which implements the
/1/events/endpoint of our public API. We have a few example repos (in Go, Ruby, and Python) that show off a real application with custom instrumentation.
Gatekeeperlogs, then see for yourself how easy it is to ingest JSON logs into Honeycomb.
This tour recommends a specific series of interesting queries to use in order to track down the source of the outage. Along the way, please explore at will—this is real production data, with the sorts of wide rows and plentiful metadata that we use internally to answer questions day-to-day.
The example applications below were written to illustrate instrumentation and context propagation in a production application using our Beelines.
Be sure to take a whirl around their READMEs, in which the authors highlight some particularly useful instrumentation patterns.
These repositories are also particularly useful as a reference while exploring the real-life JSON output or the live tour.
grep for a particular string to track down custom attributes and see how context is pushed down to and made available in various helper functions.
Download the JSON file here (160MB gzipped). It contains a selection of real production data from our
Gatekeeper instances and may be helpful for understanding the output of our example applications as well as understanding the raw data behind our live tour.
If you’d like to load this JSON directly into your own Honeycomb account, keep reading; if not, jump ahead to the next section.
honeytail is our recommended utility for importing a JSON logfile into Honeycomb.
Install the latest version of
honeytail by running:
wget -q https://honeycomb.io/download/honeytail/linux/honeytail_1.762_amd64.deb && \ echo 'd7bed8a005cbc6a34b232c54f0f84b945f0bb90905c67f85cceaedee9bbbad1e honeytail_1.762_amd64.deb' | sha256sum -c && \ sudo dpkg -i honeytail_1.762_amd64.deb
wget -q https://honeycomb.io/download/honeytail/linux/honeytail-1.762-1.x86_64.rpm && \ echo 'd8069b9d344fdabc722efae0d5bff02112c442fa5bfc9fda6b7001093b5578dd honeytail-1.762-1.x86_64.rpm' | sha256sum -c && \ sudo rpm -i honeytail-1.762-1.x86_64.rpm
wget -q -O honeytail https://honeycomb.io/download/honeytail/linux/1.762 && \ echo '00e24441316c7ae24665b1aaea4cbb77e2ee52c83397bf67d70f3ffe14a1e341 honeytail' | sha256sum -c && \ chmod 755 ./honeytail
The packages install
honeytail, its config file
/etc/honeytail/honeytail.conf, and some start scripts. The binary is just
honeytail, available if you need it in an unpackaged form or for ad-hoc use.
Once you’ve extracted the
.json file, invoke
honeytail with the
--rebase_time argument in order to “rebase” timestamps to load data into the current time window.
--writekey, or API key, is accessible on your account page.
honeytail \ --writekey=YOUR_API_KEY \ --dataset="Gatekeeper Tour" \ --file=./honeycomb-gatekeeper-tour.json \ --parser=json \ --backfill \ --rebase_time \ # "Rebase" the JSON file's timestamps onto the current time --presampled=meta.sample_rate # Interpret `meta.sample_rate` as the sample rate
That’s it! The data should now live in Honeycomb and be available for querying.
Honeycomb can help locate system issues. This demo walks through a real outage on the Honeycomb servers. This data is an anonymized selection of real events from our servers from that incident.
Your first query
Your pager goes off: your end-to-end tests, which automate and check for a key user flow, are timing out! Let’s see what the slowest 5% of users are experiencing by looking at the 95th percentile (P95).
Click on the
Calculate box, enter
P95(duration_ms), and click “Run Query”.
Break down your graph into segments
This result looks terrible! Starting about 35 minutes into the data, and continuing for about half an hour, some queries are delayed by up to 20 minutes. That’s really not good. Does this affect everyone? Maybe it’s just one team that’s having the problem.
In the box labelled
Break Down, enter
app.team.ID and click “Run Query”.
Zoom back out
Breaking down by
app.team.ID draws one curve per team. That’s a high cardinality field.
It looks like all the teams are affected by this problem. How is this affecting the amount of work our system does?
Click in the box labeled
Break Down and hit the blue [x] to remove the breakdown. Then, in the
Calculate box , add
Visualize the distribution
During this period,
Gatekeeper throughput plummeted: these slow requests really messed up our system! One way to dive in a little further is to look at the distribution of the delay in more detail: are we seeing one population of “slow” and one of “fast,” or is it evenly spread?
Calculate, replace the
P95(duration_ms) with a
Inspect the heatmap
A heatmap shows the distribution of events, for each time interval. Each dot represents one or more events with that duration; brighter spots mean more items fell in that bucket. We can pursue a few different explanations of why this might be.
Validate a hypothesis
By looking at the code, we know that a request’s
duration_ms consists of the time necessary to
1) get the schema for a dataset, 2) determine the partition to write to, and 3) write the payload. Do any of these three steps drive the increased latency we’re seeing?
Calculate, add heatmaps for
Start exploring elevated
It looks like the problem lies almost entirely within app.timer.get_schema_dur_ms! That’s useful news: it seems isolated to our database read. Our servers are hosted in several different zones: is there a chance that one of them is tied to delays in talking to our database?
Calculate, remove all fields except
Hypothesis: are AWS availability zones correlated with increased DB times?
By hovering your mouse over the different rows of the results, it’s pretty clear that all four zones are behaving the same way (and thus are not the problem). Let’s check another angle: hits on the database cache should be fast, but misses should be slow.
GROUP BY, and instead put in
get_schema_dur_ms times should be tied to cache misses
Cache hits are fast; the triangle is made of cache misses getting slower and slower over time. Let’s filter to only the cache misses.
Right click on the
false row in the
app.hitSchemaCache column to filter to only requests that missed the cache.
Drill into a single high-cardinality value to isolate request patterns
When looking at this larger dataset, the data points can smear over each other. Sometimes it’s easier to tell what’s going on if we focus on one dataset. There are a lot ways you can proceed: Honeycob rewards multiple approaches. We can group and filter on the dataset.
GROUP BY, then click “Run Query”. Filter to just
app.dataset.ID = 100011, which has an interesting
Narrow the time window by dragging on the graph
There are visible diagonal stripes across the diagram, weird—seems like we might have a constant timeout somewhere, every every five minutes or so. Reviewing the code, we see the cache falls through to the “database” every 5 minutes—and then seems to time out.
Select a time period in the data and press the magnifying glass to zoom in on a time span.
Select the time period surrounding the first large spike in the
COUNT graph (approximately 38-40 minutes into the data) and press the magnifying glass to zoom in.
View the Raw Data underneath the graph
With the subset of data selected, we can look directly at some rows to understand better what happens in them. There is nothing like looking at a sample of underlying events to understand what’s going on.
Click “Raw Data” to explore them. (Look for the “Raw Data” button directly above the graph on the left.)
If we scroll over to the
error_detail columns, we see a few rows are showing error 1040. Even though
error_detail could be a very high cardinality column, we can even group by that string!
Don’t see the
error field? “Raw Data” only shows a subset of all the data visualized by the graph. Try adding a Filter of
error exists to narrow in on just the data that have errors.
error_detail to the
GROUP BY and click “Run Query”.
The error 1040s are sporadic but consistent with a database that has failed. It looks like we know everything we can learn from this system. At 5:38, the database started processing fewer transactions, and throughput slowed to a trickle. Caches started failing; queries timed out. The database came back after twenty minutes, and anything that hadn’t timed out was processed.
While the accessible data tells a specific story of ours, take this opportunity to really explore this dataset! Here are a few different interesting questions to ask on our Gatekeeper Tour:
2xxresponses from our API?