Best Practices for Alerts

We recommend that you follow certain best practices when creating alerts.

Triggers 

  • Use the Name and Description fields effectively. The Name field should tell you what the alert is; the Description field should tell you what to do about the alert. Links to internal wikis or runbooks are best.
  • Use filters to improve the quality of your signal. If you are interested in latency, but have a long poll endpoint, use a filter to remove that endpoint from the calculation rather than adjusting the values of the threshold.
  • To detect spikes in latency metrics, combine a filter with your cutoff (for example, >100ms) with a COUNT. Your result will be the number of events that exceed your threshold.
  • To ignore spikes in latency and trigger on overall performance, use the P95 or P99 calculations. These will be more representative of the majority of traffic than AVG, which can be polluted by large outliers.
  • When detecting errors, allow good values instead of looking for bad values. For example, instead of building a filter of HTTP status codes == 500, use several filters to look for events that do not have status codes 200, 301, 302, or 404.

SLOs 

Some of these are general guidelines, and some are specific to alert type.

General Guidelines 

Regardless of the alert type:

  • Iterate when creating Burn Alerts. Start by sending alerts to an internal recipient (either a team member’s email address or a private Slack channel) to monitor the frequency of alerts in your system. Use these Burn Alerts as a first step toward understanding how your service performs and what kinds of alerts are actionable and important to your team, and then iterate.
  • Start with the shape of the signal that you care about:
    • For slow SLO burn, you care about issues that occur over a prolonged time period.
    • For fast SLO burn, you care about significant spikes over a shorter time period.
  • Use alerts to refine any new SLOs that you create. For new SLOs, start with a Budget Rate alert, which will notify you when system conditions impact your budget, to learn:
    • If you are missing any criteria in your SLI.
    • If you can historically sustain your SLO.

Exhaustion Time Burn Alerts 

When choosing the length of time for a given Budget Exhaustion burn alert, consider the context and goals of your organization. Ask questions to help frame the definition of some initial Exhaustion Time burn alerts. If you are X hours away from running out of budget:

  • Who would need to know
  • Via what method
  • What would they need to do

For example, a 24-hour exhaustion time alert can be useful if service quality is slowly degrading and a Slack-based notification allows the team to remediate the issue before the budget reaches zero (0). Alternatively, a 4-hour exhaustion time alert may be more urgent and require a pager notification, such as from PagerDuty.

Tip
We recommend creating at least one Exhaustion Time burn alert where the Exhaustion Time is 0. This will notify you when your SLO budget is completely exhausted.

Budget Rate Burn Alerts 

When starting with a Budget Rate burn alert, consider whether you seek an alert for a smooth, slow burn or a fast, abrupt drop. Start with a less-sensitive alert and adjust as needed. Depending on the length of your SLO’s time period, try these values when creating Budget Rate burn alerts.

30 Day SLO Example 

Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.

Budget Decrease (%) Time Window Notification Type
2% 1 hour PagerDuty
5% 6 hour PagerDuty
10% 3 days Slack

7 Day SLO Example 

Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.

Budget Decrease (%) Time Window Notification Type
8.5% 1h PagerDuty
21.5% 6h PagerDuty
43.20% 3 days Slack
50% 3.5 days Slack

Use the Time Window to Determine the Notification Method 

A long Time Window, such as 24 hours, is useful in detecting long, slow burns that use up your SLO budget faster than expected, but not fast enough to wake someone out of bed. A short Time Window, such as one hour, is useful in detecting very fast SLO budget burns that need to be addressed quickly.

Use the time window to determine the alert method. For example:

  • For a long, slow SLO budget decrease, send a Slack message, so the issue can be addressed during business hours.
  • For a short, fast SLO budget decrease, send a critical PagerDuty notification, so it can be acted on immediately.
Note

Although it may be counterintuitive, a Budget Rate alert with a long time window will also activate on a short, fast burn.

For example, if you have two Budget Rate burn alerts with the parameters:

  • Notify when the Budget Decrease exceeds 25% over a 24 hour period
  • Notify when the Budget Decrease exceeds 25% over a 1 hour period

If your environment encountered a large spike of errors and burned 25% of your SLO budget in the last hour, both the 24-hour Budget Rate burn alert and the 1-hour Budget Rate burn alert will fire, because both include the last hour in their calculation.

You might ask, since both Burn Alerts activated, why do you need both? You need both if you want to control where the Burn Alert notifies.

Use Budget Decrease to Control Alert Frequency 

To control the frequency of your Burn Alert and calibrate its sensitivity:

  • Increase the Budget Decrease value if the alert is too noisy.
  • Decrease the Budget Decrease value if the alert is too quiet.

Guidelines for when to use SLOs and Trigger Alerts 

In highly dynamic systems, attention is a scarce resource. There are more signals than you can process, and large systems often have ongoing issues. Even with a team, it is impossible to handle everything. You cannot grasp the entire system, and you should not be expected to.

SLOs and trigger alerts can help, but to make them most efficient, you should implement them with some general rules in mind.

General Guidelines 

To manage your scarce attention effectively:

  • Design thoughtful alerts: Create alerts that do not compete aggressively for your attention. They should only interrupt you for genuinely important matters, similar to how you would respond to a busy colleague.
  • Prioritize interrupts: Use monitoring and alerting systems to direct your attention when needed but avoid distracting you unnecessarily.

Remember the principle: “Treat the patient, not the alarm.”

Prioritize users who will be directly impacted 

Prioritize anyone who has a vested interest and who will be directly impacted. This almost always includes customers, and for development and deployment flows, this also includes other engineers.

Your on-call structure means that one or two people already respond for many services. Setting an SLO for a specific microservice will not prevent someone else from being paged–only the same people on the same rotation.

Prioritize issues that may surface bigger conversations 

The second-order goal of an SLO is to serve as a guide when discussing how to allocate engineering effort. If you are starting to slip on your ability to properly serve your users, you need to discuss what type of work you prioritize. The best SLO candidates are issues that, if unresolved, may surface bigger conversations throughout your organization.

Limit pager notifications 

Rather than notifying via pager for all alerts, consider perceived urgency and time of day. For a discussion of pager notification for specific alert types, see the Pager Notifications section.

Example Errors and Alert Types 

Because SLOs and Triggers work similarly (repetitively check for a bad value threshold to be crossed, then warn), choosing when to use each can be difficult. The choice becomes more difficult because SLOs can encompass a wide range of “users”, including customers, coworkers, and other services.

In the following table, we discuss the most common types of errors and our recommendations.

Type Description Example What should be used Comment
Error Rates Identify whether customer-visible interactions succeed. Could the user access their data? SLOs Ignore failures outside of your control since you cannot fix them. Often combined with performance indicators.
Performance Identify whether customer-visible interactions happen within a delay you judge to be acceptable. Did you retrieve account history within 10 seconds? Are you processing incoming transactions within N milliseconds? SLOs Try to find measures that adjust to expected cost (for example, “we expect some complex queries to take longer”). Often combined with error rates.
Assertions and pre/post conditions Identify whether some internal operations are taking place correctly based on checks you put in place. Are dangling lockfiles present? Has the CronJob run? Triggers More checks make sense when shipping something new; they act like a production test. Over time, you may want to remove some and keep only checks that indicate “things are messed up if this no longer happens”.
Normalcy Define some order of magnitude where you consider things to be “normal” and want to know when you stray. Cost of lambda queries? High login activity? Too little inbound traffic? Triggers These are limited because they come from a pre-determined normative idea of what a customer should do, or from trying to figure out what that should be. As such, they tend to be brittle. If you can find a good weighing mechanism (for example, “customers in category X should never break spend Y”), then you can turn these into SLOs.
Saturation Thresholds Identify issues that require significant ramp-up time to address to keep things safe before getting back to a stable level and issues that have fixes that are difficult to automate. In these cases, if you wait for end-user failures to show up, it’s usually too late to address the issue properly. Data retention duration and recovery procedures? Connection limits to a database? Expired certificates? Triggers Turning these into SLOs requires significant effort and tends to be closely associated with automation of contracts. For example, you could decide, “we guarantee 30% headroom on connection counts to all database users” and turn it into an SLO, but it’s much more straightforward to run a trigger that checks connection counts at regular intervals.

Pager Notifications 

When setting up SLOs and triggers, you will want to consider whether to notify your team via pager. Your decision should vary based on perceived urgency and the time of the day.

Error Rates 

From time to time, an error rate on a successful action may fail. If fixing the issue requires an hour of investigation, three to four hours of fixing, plus some time to deploy the fix, then you may want to be interrupted multiple hours ahead of time. Otherwise, there is no need to cancel a meeting or be awakened for something that can wait and be reasonably fixed the next day.

Pre/Post Conditions 

Pre/post conditions should identify issues that you would arguably want to investigate during gaps in your schedule or tasks that you could manually re-run the next day. For example, a pre/post condition could monitor connection limits to a database, which can point to sudden cascading failures.

In such cases, these should warn you early enough to not require immediate attention. You should need to address them only when they start feeling significantly broken.

Additionally, if you avoid addressing these issues immediately, you can track the rate of breakage. Though this not a critical metric, tracking it can be valuable to assess whether things are becoming more brittle over time.

Saturation Thresholds 

A saturation threshold represents a signal of future potential issues (for example, “in three months, an issue could arise”). The goal of a saturation threshold is to be visible enough, especially during slow times, for you to schedule corrective work. As such, a saturation threshold may not need to ever page you.

Normalcy 

When normalcy alerts involve small deviations from expectations (for example, “maybe we have a user going beyond normal limits to try something out right now”), they may resemble saturation thresholds. However, when there are huge deviations from expectations (for example, “that’s more than 10 times the very generous spend we expect; is someone actively abusing the platform?”), normalcy alerts may become full-blown incidents.

For both cases, consider creating staggered alerts–one alert that warns you via email or instant messaging, and one that pages you. If you plan to page for suspected abuse, you should also plan appropriate actions. For example, you may want to suspend the user or escalate to a team or department responsible for this type of issue.

Burn rate 

For SLOs, your main tool to mediate and escalate alerting is the burn rate. If the budget is about to be empty in 24 hours, you should notify via email or instant messaging. If the budget is about to be empty in approximately 4 hours (a safe default), then you should page.