We recommend that you follow certain best practices when creating alerts.Documentation Index
Fetch the complete documentation index at: https://docs.honeycomb.io/llms.txt
Use this file to discover all available pages before exploring further.
Triggers
- Use the Name and Description fields effectively. The Name field should tell you what the alert is; the Description field should tell you what to do about the alert. Links to internal wikis or runbooks are best.
- Use filters to improve the quality of your signal. If you are interested in latency, but have a long poll endpoint, use a filter to remove that endpoint from the calculation rather than adjusting the values of the threshold.
- To detect spikes in latency metrics, combine a filter with your cutoff (for example,
>100ms) with aCOUNT. Your result will be the number of events that exceed your threshold. - To ignore spikes in latency and trigger on overall performance, use the
P95orP99calculations. These will be more representative of the majority of traffic thanAVG, which can be polluted by large outliers. - When detecting errors, allow good values instead of looking for bad values.
For example, instead of building a filter of HTTP status codes
== 500, use several filters to look for events that do not have status codes200,301,302, or404.
SLOs
Some of these are general guidelines, and some are specific to alert type.General Guidelines
Regardless of the alert type:- Iterate when creating Burn Alerts. Start by sending alerts to an internal recipient (either a team member’s email address or a private Slack channel) to monitor the frequency of alerts in your system. Use these Burn Alerts as a first step toward understanding how your service performs and what kinds of alerts are actionable and important to your team, and then iterate.
- Start with the shape of the signal that you care about:
- For slow SLO burn, you care about issues that occur over a prolonged time period.
- For fast SLO burn, you care about significant spikes over a shorter time period.
- Use alerts to refine any new SLOs that you create.
For new SLOs, start with a Budget Rate alert, which will notify you when system conditions impact your budget, to learn:
- If you are missing any criteria in your SLI.
- If you can historically sustain your SLO.
Exhaustion Time Burn Alerts
When choosing the length of time for a given Budget Exhaustion burn alert, consider the context and goals of your organization. Ask questions to help frame the definition of some initial Exhaustion Time burn alerts. If you are X hours away from running out of budget:- Who would need to know
- Via what method
- What would they need to do
0).
Alternatively, a 4-hour exhaustion time alert may be more urgent and require a pager notification, such as from PagerDuty.
Budget Rate Burn Alerts
When starting with a Budget Rate burn alert, consider whether you seek an alert for a smooth, slow burn or a fast, abrupt drop. Start with a less-sensitive alert and adjust as needed. Depending on the length of your SLO’s time period, try these values when creating Budget Rate burn alerts.30 Day SLO Example
Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.| Budget Decrease (%) | Time Window | Notification Type |
|---|---|---|
| 2% | 1 hour | PagerDuty |
| 5% | 6 hour | PagerDuty |
| 10% | 3 days | Slack |
7 Day SLO Example
Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.| Budget Decrease (%) | Time Window | Notification Type |
|---|---|---|
| 8.5% | 1h | PagerDuty |
| 21.5% | 6h | PagerDuty |
| 43.20% | 3 days | Slack |
| 50% | 3.5 days | Slack |
Use the Time Window to Determine the Notification Method
A long Time Window, such as 24 hours, is useful in detecting long, slow burns that use up your SLO budget faster than expected, but not fast enough to wake someone out of bed. A short Time Window, such as one hour, is useful in detecting very fast SLO budget burns that need to be addressed quickly. Use the time window to determine the alert method. For example:- For a long, slow SLO budget decrease, send a Slack message, so the issue can be addressed during business hours.
- For a short, fast SLO budget decrease, send a critical PagerDuty notification, so it can be acted on immediately.
Although it may be counterintuitive, a Budget Rate alert with a long time window will also activate on a short, fast burn.For example, if you have two Budget Rate burn alerts with the parameters:
- Notify when the Budget Decrease exceeds 25% over a 24 hour period
- Notify when the Budget Decrease exceeds 25% over a 1 hour period
Use Budget Decrease to Control Alert Frequency
To control the frequency of your Burn Alert and calibrate its sensitivity:- Increase the Budget Decrease value if the alert is too noisy.
- Decrease the Budget Decrease value if the alert is too quiet.
Guidelines for when to use SLOs and Trigger Alerts
In highly dynamic systems, attention is a scarce resource. There are more signals than you can process, and large systems often have ongoing issues. Even with a team, it is impossible to handle everything. You cannot grasp the entire system, and you should not be expected to. SLOs and trigger alerts can help, but to make them most efficient, you should implement them with some general rules in mind.General Guidelines
To manage your scarce attention effectively:- Design thoughtful alerts: Create alerts that do not compete aggressively for your attention. They should only interrupt you for genuinely important matters, similar to how you would respond to a busy colleague.
- Prioritize interrupts: Use monitoring and alerting systems to direct your attention when needed but avoid distracting you unnecessarily.
Prioritize users who will be directly impacted
Prioritize anyone who has a vested interest and who will be directly impacted. This almost always includes customers, and for development and deployment flows, this also includes other engineers. Your on-call structure means that one or two people already respond for many services. Setting an SLO for a specific microservice will not prevent someone else from being paged—only the same people on the same rotation.Prioritize issues that may surface bigger conversations
The second-order goal of an SLO is to serve as a guide when discussing how to allocate engineering effort. If you are starting to slip on your ability to properly serve your users, you need to discuss what type of work you prioritize. The best SLO candidates are issues that, if unresolved, may surface bigger conversations throughout your organization.Limit pager notifications
Rather than notifying via pager for all alerts, consider perceived urgency and time of day. For a discussion of pager notification for specific alert types, see the Pager Notifications section.Example Errors and Alert Types
Because SLOs and Triggers work similarly (repetitively check for a bad value threshold to be crossed, then warn), choosing when to use each can be difficult. The choice becomes more difficult because SLOs can encompass a wide range of “users”, including customers, coworkers, and other services. In the following table, we discuss the most common types of errors and our recommendations.| Type | Description | Example | What should be used | Comment |
|---|---|---|---|---|
| Error Rates | Identify whether customer-visible interactions succeed. | Could the user access their data? | SLOs | Ignore failures outside of your control since you cannot fix them. Often combined with performance indicators. |
| Performance | Identify whether customer-visible interactions happen within a delay you judge to be acceptable. | Did you retrieve account history within 10 seconds? Are you processing incoming transactions within N milliseconds? | SLOs | Try to find measures that adjust to expected cost (for example, “we expect some complex queries to take longer”). Often combined with error rates. |
| Assertions and pre/post conditions | Identify whether some internal operations are taking place correctly based on checks you put in place. | Are dangling lockfiles present? Has the CronJob run? | Triggers | More checks make sense when shipping something new; they act like a production test. Over time, you may want to remove some and keep only checks that indicate “things are messed up if this no longer happens”. |
| Normalcy | Define some order of magnitude where you consider things to be “normal” and want to know when you stray. | Cost of lambda queries? High login activity? Too little inbound traffic? | Triggers | These are limited because they come from a pre-determined normative idea of what a customer should do, or from trying to figure out what that should be. As such, they tend to be brittle. If you can find a good weighing mechanism (for example, “customers in category X should never break spend Y”), then you can turn these into SLOs. |
| Saturation Thresholds | Identify issues that require significant ramp-up time to address to keep things safe before getting back to a stable level and issues that have fixes that are difficult to automate. In these cases, if you wait for end-user failures to show up, it’s usually too late to address the issue properly. | Data retention duration and recovery procedures? Connection limits to a database? Expired certificates? | Triggers | Turning these into SLOs requires significant effort and tends to be closely associated with automation of contracts. For example, you could decide, “we guarantee 30% headroom on connection counts to all database users” and turn it into an SLO, but it’s much more straightforward to run a trigger that checks connection counts at regular intervals. |