Our Gatekeeper Tour shows what it’s like to debug a real production incident with Honeycomb. In this guide, we’ll:
Gatekeeper
is the name of our API server, which implements the /1/events/
endpoint of our public API. We have a few example repos (in Go, Ruby, and Python) that show off a real application with custom instrumentation.Gatekeeper
logs, then see for yourself how easy it is to ingest JSON logs into Honeycomb.This tour recommends a specific series of interesting queries to use in order to track down the source of the outage. Along the way, please explore at will—this is real production data, with the sorts of wide rows and plentiful metadata that we use internally to answer questions day-to-day.
The example applications below were written to illustrate instrumentation and context propagation in a production application using our Beelines.
Be sure to take a whirl around their READMEs, in which the authors highlight some particularly useful instrumentation patterns.
Gatekeeper
Gatekeeper
Gatekeeper
These repositories are also particularly useful as a reference while exploring the real-life JSON output or the live tour. grep
for a particular string to track down custom attributes and see how context is pushed down to and made available in various helper functions.
Download the JSON file here (160MB gzipped). It contains a selection of real production data from our Gatekeeper
instances and may be helpful for understanding the output of our example applications as well as understanding the raw data behind our live tour.
If you’d like to load this JSON directly into your own Honeycomb account, keep reading; if not, jump ahead to the next section.
honeytail
is our recommended utility for importing a JSON logfile into Honeycomb.
Install the latest version of honeytail
by running:
# Download and install the AMD64 debian package
wget -q https://honeycomb.io/download/honeytail/v1.4.1/honeytail_1.4.1_amd64.deb && \
echo '3d77fd956b0990919b523f37ec98ac29f40bb0c6181db203678546d55bf4cf81 honeytail_1.4.1_amd64.deb' | sha256sum -c && \
sudo dpkg -i honeytail_1.4.1_amd64.deb
# Download and install the ARM64 debian package
wget -q https://honeycomb.io/download/honeytail/v1.4.1/honeytail_1.4.1_arm64.deb && \
echo '8e90d302724ad1f6f593ea3588e45a577c8d266b899912179627e17c8adc4a5f honeytail_1.4.1_arm64.deb' | sha256sum -c && \
sudo dpkg -i honeytail_1.4.1_arm64.deb
# Download and install the rpm package
wget -q https://honeycomb.io/download/honeytail/v1.4.1/honeytail-1.4.1-1.x86_64.rpm && \
echo '9ebc9ae106de13656e06ebfb2dc198ab6e8f0d525c5b16180aa6e80c5e5054c2 honeytail-1.4.1-1.x86_64.rpm' | sha256sum -c && \
sudo rpm -i honeytail-1.4.1-1.x86_64.rpm
wget -q -O honeytail https://honeycomb.io/download/honeytail/v1.4.1/honeytail-linux-amd64 && \
echo '1998f4239e95fb423fc456667968c30d488a7c91a442fca6ac14b47f9fa70f1d honeytail' | sha256sum -c && \
chmod 755 ./honeytail
wget -q -O honeytail https://honeycomb.io/download/honeytail/v1.4.1/honeytail-linux-arm64 && \
echo 'fad2346c7235837289dfaaa6d1797033e091684215c640cbe4ecf83d320347b0 honeytail' | sha256sum -c && \
chmod 755 ./honeytail
wget -q -O honeytail https://honeycomb.io/download/honeytail/v1.4.1/honeytail-darwin-amd64 && \
echo 'd5527d6f01fa8d5e8360b539d866a09615123f78178c89f431bd0d9f09cb3560 honeytail' | shasum -a 256 -c && \
chmod 755 ./honeytail
# Build from latest source after setting up go
git clone https://github.com/honeycombio/honeytail
cd honeytail; go install
The packages install honeytail
, its config file /etc/honeytail/honeytail.conf
,
and some start scripts. Build honeytail
from source if you need it in an unpackaged form or for ad-hoc use.
Once you’ve extracted the .json
file, invoke honeytail
with the --rebase_time
argument in order to “rebase” timestamps to load data into the current time window.
Your --writekey
, or API key, is accessible on your account page.
honeytail \
--writekey=YOUR_API_KEY \
--dataset="Gatekeeper Tour" \
--file=./honeycomb-gatekeeper-tour.json \
--parser=json \
--backfill \
--rebase_time \ # "Rebase" the JSON file's timestamps onto the current time
--presampled=meta.sample_rate # Interpret `meta.sample_rate` as the sample rate
That’s it! The data should now live in Honeycomb and be available for querying.
Honeycomb can help locate system issues. This demo walks through a real outage on the Honeycomb servers. This data is an anonymized selection of real events from our servers from that incident.
Your first query
Your pager goes off: your end-to-end tests, which automate and check for a key user flow, are timing out! Let’s see what the slowest 5% of users are experiencing by looking at the 95th percentile (P95).
Click on the Visualize
box, enter P95(duration_ms)
, and click “Run Query”.
Group your graph into segments
This result looks terrible! Starting about 35 minutes into the data, and continuing for about half an hour, some queries are delayed by up to 20 minutes. That’s really not good. Does this affect everyone? Maybe it’s just one team that’s having the problem.
In the box labelled Group By
, enter app.team.ID
and click “Run Query”.
Zoom back out
Grouping by app.team.ID
draws one curve per team. That’s a high cardinality field.
It looks like all the teams are affected by this problem. How is this affecting the amount of work our system does?
Click in the box labeled Group By
and hit the blue [x] to remove the clause. Then, in the Visualize
box , add COUNT
.
Visualize the distribution
During this period, Gatekeeper
throughput plummeted: these slow requests really messed up our system! One way to dive in a little further is to look at the distribution of the delay in more detail: are we seeing one population of “slow” and one of “fast,” or is it evenly spread?
In Visualize
, replace the P95(duration_ms)
with a HEATMAP(duration_ms)
.
Inspect the heatmap
A heatmap shows the distribution of events, for each time interval. Each dot represents one or more events with that duration; brighter spots mean more items fell in that bucket. We can pursue a few different explanations of why this might be.
Validate a hypothesis
By looking at the code, we know that a request’s duration_ms
consists of the time necessary to
In Visualize
, add heatmaps for app.timer.get_schema_dur_ms
, app.timer.get_partition_dur_ms
, and app.timer.kafka_write_dur_ms
.
Start exploring elevated app.timer.get_schema_dur_ms
times
It looks like the problem lies almost entirely within app.timer.get_schema_dur_ms! That’s useful news: it seems isolated to our database read. Our servers are hosted in several different zones: is there a chance that one of them is tied to delays in talking to our database?
In Visualize
, remove all fields except COUNT
and app.timer.get_schema_dur_ms
. Add meta.availability_zone
to GROUP BY
.
Hypothesis: are AWS availability zones correlated with increased DB times?
By hovering your mouse over the different rows of the results, it’s pretty clear that all four zones are behaving the same way (and thus are not the problem). Let’s check another angle: hits on the database cache should be fast, but misses should be slow.
Remove meta.availability_zone
from GROUP BY
, and instead put in app.hitSchemaCache
.
Hypothesis: elevated get_schema_dur_ms
times should be tied to cache misses
Cache hits are fast; the triangle is made of cache misses getting slower and slower over time. Let’s filter to only the cache misses.
Click the context menu on the false
row in the app.hitSchemaCache
column to filter to only requests that missed the cache.
Drill into a single high-cardinality value to isolate request patterns
When looking at this larger dataset, the data points can smear over each other. Sometimes it’s easier to tell what’s going on if we focus on one dataset. There are a lot ways you can proceed: Honeycob rewards multiple approaches. We can group and filter on the dataset.
Add app.dataset.ID
to GROUP BY
, then click “Run Query”. Set the Where clause to just app.dataset.ID = 100011
, which has an interesting HEATMAP(app.timer.get_schema_dur_ms)
pattern.
Narrow the time window by dragging on the graph
There are visible diagonal stripes across the diagram, weird—seems like we might have a constant timeout somewhere, every every five minutes or so. Reviewing the code, we see the cache falls through to the “database” every 5 minutes—and then seems to time out.
Select a time period in the data and press the magnifying glass to zoom in on a time span.
Select the time period surrounding the first large spike in the COUNT
graph (approximately 38-40 minutes into the data) and press the magnifying glass to zoom in.
View the Raw Data underneath the graph
With the subset of data selected, we can look directly at some rows to understand better what happens in them. There is nothing like looking at a sample of underlying events to understand what’s going on.
Click “Raw Data” to explore them. (Look for the “Raw Data” button directly above the graph on the left.)
Check for error
s
If we scroll over to the error
and error_detail
columns, we see a few rows are showing error 1040. Even though error_detail
could be a very high cardinality column, we can even group by that string!
Don’t see the error
field? “Raw Data” only shows a subset of all the data visualized by the graph. Try adding a Where clause of error exists
to narrow in on just the data that have errors.
Add error
and error_detail
to the GROUP BY
and click “Run Query”.
Conclusion
The error 1040s are sporadic but consistent with a database that has failed. It looks like we know everything we can learn from this system. At 5:38, the database started processing fewer transactions, and throughput slowed to a trickle. Caches started failing; queries timed out. The database came back after twenty minutes, and anything that hadn’t timed out was processed.
While the accessible data tells a specific story of ours, take this opportunity to really explore this dataset! Here are a few different interesting questions to ask on our Gatekeeper Tour:
2xx
responses from our API?"witty-sawfish"
?