We recommend that you follow certain best practices when creating alerts.
>100ms
) with a COUNT
.
Your result will be the number of events that exceed your threshold.P95
or P99
calculations.
These will be more representative of the majority of traffic than AVG
, which can be polluted by large outliers.== 500
, use several filters to look for events that do not have status codes 200
, 301
, 302
, or 404
.Some of these are general guidelines, and some are specific to alert type.
Regardless of the alert type:
When choosing the length of time for a given Budget Exhaustion burn alert, consider the context and goals of your organization. Ask questions to help frame the definition of some initial Exhaustion Time burn alerts. If you are X hours away from running out of budget:
For example, a 24-hour exhaustion time alert can be useful if service quality is slowly degrading and a Slack-based notification allows the team to remediate the issue before the budget reaches zero (0
).
Alternatively, a 4-hour exhaustion time alert may be more urgent and require a pager notification, such as from PagerDuty.
0
.
This will notify you when your SLO budget is completely exhausted.When starting with a Budget Rate burn alert, consider whether you seek an alert for a smooth, slow burn or a fast, abrupt drop. Start with a less-sensitive alert and adjust as needed. Depending on the length of your SLO’s time period, try these values when creating Budget Rate burn alerts.
Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.
Budget Decrease (%) | Time Window | Notification Type |
---|---|---|
2% | 1 hour | PagerDuty |
5% | 6 hour | PagerDuty |
10% | 3 days | Slack |
Use the following example to create a series of Budget Rate burn alerts for your SLO. Each row represents an alert and its values.
Budget Decrease (%) | Time Window | Notification Type |
---|---|---|
8.5% | 1h | PagerDuty |
21.5% | 6h | PagerDuty |
43.20% | 3 days | Slack |
50% | 3.5 days | Slack |
A long Time Window, such as 24 hours, is useful in detecting long, slow burns that use up your SLO budget faster than expected, but not fast enough to wake someone out of bed. A short Time Window, such as one hour, is useful in detecting very fast SLO budget burns that need to be addressed quickly.
Use the time window to determine the alert method. For example:
Although it may be counterintuitive, a Budget Rate alert with a long time window will also activate on a short, fast burn.
For example, if you have two Budget Rate burn alerts with the parameters:
If your environment encountered a large spike of errors and burned 25% of your SLO budget in the last hour, both the 24-hour Budget Rate burn alert and the 1-hour Budget Rate burn alert will fire, because both include the last hour in their calculation.
You might ask, since both Burn Alerts activated, why do you need both? You need both if you want to control where the Burn Alert notifies.
To control the frequency of your Burn Alert and calibrate its sensitivity:
In highly dynamic systems, attention is a scarce resource. There are more signals than you can process, and large systems often have ongoing issues. Even with a team, it is impossible to handle everything. You cannot grasp the entire system, and you should not be expected to.
SLOs and trigger alerts can help, but to make them most efficient, you should implement them with some general rules in mind.
To manage your scarce attention effectively:
Remember the principle: “Treat the patient, not the alarm.”
Prioritize anyone who has a vested interest and who will be directly impacted. This almost always includes customers, and for development and deployment flows, this also includes other engineers.
Your on-call structure means that one or two people already respond for many services. Setting an SLO for a specific microservice will not prevent someone else from being paged–only the same people on the same rotation.
The second-order goal of an SLO is to serve as a guide when discussing how to allocate engineering effort. If you are starting to slip on your ability to properly serve your users, you need to discuss what type of work you prioritize. The best SLO candidates are issues that, if unresolved, may surface bigger conversations throughout your organization.
Rather than notifying via pager for all alerts, consider perceived urgency and time of day. For a discussion of pager notification for specific alert types, see the Pager Notifications section.
Because SLOs and Triggers work similarly (repetitively check for a bad value threshold to be crossed, then warn), choosing when to use each can be difficult. The choice becomes more difficult because SLOs can encompass a wide range of “users”, including customers, coworkers, and other services.
In the following table, we discuss the most common types of errors and our recommendations.
Type | Description | Example | What should be used | Comment |
---|---|---|---|---|
Error Rates | Identify whether customer-visible interactions succeed. | Could the user access their data? | SLOs | Ignore failures outside of your control since you cannot fix them. Often combined with performance indicators. |
Performance | Identify whether customer-visible interactions happen within a delay you judge to be acceptable. | Did you retrieve account history within 10 seconds? Are you processing incoming transactions within N milliseconds? | SLOs | Try to find measures that adjust to expected cost (for example, “we expect some complex queries to take longer”). Often combined with error rates. |
Assertions and pre/post conditions | Identify whether some internal operations are taking place correctly based on checks you put in place. | Are dangling lockfiles present? Has the CronJob run? | Triggers | More checks make sense when shipping something new; they act like a production test. Over time, you may want to remove some and keep only checks that indicate “things are messed up if this no longer happens”. |
Normalcy | Define some order of magnitude where you consider things to be “normal” and want to know when you stray. | Cost of lambda queries? High login activity? Too little inbound traffic? | Triggers | These are limited because they come from a pre-determined normative idea of what a customer should do, or from trying to figure out what that should be. As such, they tend to be brittle. If you can find a good weighing mechanism (for example, “customers in category X should never break spend Y”), then you can turn these into SLOs. |
Saturation Thresholds | Identify issues that require significant ramp-up time to address to keep things safe before getting back to a stable level and issues that have fixes that are difficult to automate. In these cases, if you wait for end-user failures to show up, it’s usually too late to address the issue properly. | Data retention duration and recovery procedures? Connection limits to a database? Expired certificates? | Triggers | Turning these into SLOs requires significant effort and tends to be closely associated with automation of contracts. For example, you could decide, “we guarantee 30% headroom on connection counts to all database users” and turn it into an SLO, but it’s much more straightforward to run a trigger that checks connection counts at regular intervals. |
When setting up SLOs and triggers, you will want to consider whether to notify your team via pager. Your decision should vary based on perceived urgency and the time of the day.
From time to time, an error rate on a successful action may fail. If fixing the issue requires an hour of investigation, three to four hours of fixing, plus some time to deploy the fix, then you may want to be interrupted multiple hours ahead of time. Otherwise, there is no need to cancel a meeting or be awakened for something that can wait and be reasonably fixed the next day.
Pre/post conditions should identify issues that you would arguably want to investigate during gaps in your schedule or tasks that you could manually re-run the next day. For example, a pre/post condition could monitor connection limits to a database, which can point to sudden cascading failures.
In such cases, these should warn you early enough to not require immediate attention. You should need to address them only when they start feeling significantly broken.
Additionally, if you avoid addressing these issues immediately, you can track the rate of breakage. Though this not a critical metric, tracking it can be valuable to assess whether things are becoming more brittle over time.
A saturation threshold represents a signal of future potential issues (for example, “in three months, an issue could arise”). The goal of a saturation threshold is to be visible enough, especially during slow times, for you to schedule corrective work. As such, a saturation threshold may not need to ever page you.
When normalcy alerts involve small deviations from expectations (for example, “maybe we have a user going beyond normal limits to try something out right now”), they may resemble saturation thresholds. However, when there are huge deviations from expectations (for example, “that’s more than 10 times the very generous spend we expect; is someone actively abusing the platform?”), normalcy alerts may become full-blown incidents.
For both cases, consider creating staggered alerts–one alert that warns you via email or instant messaging, and one that pages you. If you plan to page for suspected abuse, you should also plan appropriate actions. For example, you may want to suspend the user or escalate to a team or department responsible for this type of issue.
For SLOs, your main tool to mediate and escalate alerting is the burn rate. If the budget is about to be empty in 24 hours, you should notify via email or instant messaging. If the budget is about to be empty in approximately 4 hours (a safe default), then you should page.