In highly dynamic systems, attention is a scarce resource. There are more signals than you can process, and large systems often have ongoing issues. Even with a team, it is impossible to handle everything. You cannot grasp the entire system, and you should not be expected to.
SLOs and trigger alerts can help, but to make them most efficient, you should implement them with some general rules in mind.
To manage your scarce attention effectively:
Remember the principle: “Treat the patient, not the alarm.”
Prioritize anyone who has a vested interest and who will be directly impacted. This almost always includes customers, and for development and deployment flows, this also includes other engineers.
Your on-call structure means that one or two people already respond for many services. Setting an SLO for a specific microservice will not prevent someone else from being paged–only the same people on the same rotation.
The second-order goal of an SLO is to serve as a guide when discussing how to allocate engineering effort. If you are starting to slip on your ability to properly serve your users, you need to discuss what type of work you prioritize. The best SLO candidates are issues that, if unresolved, may surface bigger conversations throughout your organization.
Rather than notifying via pager for all alerts, consider perceived urgency and time of day. For a discussion of pager notification for specific alert types, see the Pager Notifications section.
Because SLOs and Triggers work similarly (repetitively check for a bad value threshold to be crossed, then warn), choosing when to use each can be difficult. The choice becomes more difficult because SLOs can encompass a wide range of “users”, including customers, coworkers, and other services.
In the following table, we discuss the most common types of errors and our recommendations.
|Type||Description||Example||What should be used||Comment|
|Error Rates||Identify whether customer-visible interactions succeed.||Could the user access their data?||SLOs||Ignore failures outside of your control since you cannot fix them. Often combined with performance indicators.|
|Performance||Identify whether customer-visible interactions happen within a delay you judge to be acceptable.||Did you retrieve account history within 10 seconds? Are you processing incoming transactions within N milliseconds?||SLOs||Try to find measures that adjust to expected cost (for example, “we expect some complex queries to take longer”). Often combined with error rates.|
|Assertions and pre/post conditions||Identify whether some internal operations are taking place correctly based on checks you put in place.||Are dangling lockfiles present? Has the CronJob run?||Triggers||More checks make sense when shipping something new; they act like a production test. Over time, you may want to remove some and keep only checks that indicate “things are messed up if this no longer happens”.|
|Normalcy||Define some order of magnitude where you consider things to be “normal” and want to know when you stray.||Cost of lambda queries? High login activity? Too little inbound traffic?||Triggers||These are limited because they come from a pre-determined normative idea of what a customer should do, or from trying to figure out what that should be. As such, they tend to be brittle. If you can find a good weighing mechanism (for example, “customers in category X should never break spend Y”), then you can turn these into SLOs.|
|Saturation Thresholds||Identify issues that require significant ramp-up time to address to keep things safe before getting back to a stable level and issues that have fixes that are difficult to automate. In these cases, if you wait for end-user failures to show up, it’s usually too late to address the issue properly.||Data retention duration and recovery procedures? Connection limits to a database? Expired certificates?||Triggers||Turning these into SLOs requires significant effort and tends to be closely associated with automation of contracts. For example, you could decide, “we guarantee 30% headroom on connection counts to all database users” and turn it into an SLO, but it’s much more straightforward to run a trigger that checks connection counts at regular intervals.|
When setting up SLOs and triggers, you will want to consider whether to notify your team via pager. Your decision should vary based on perceived urgency and the time of the day.
From time to time, an error rate on a successful action may fail. If fixing the issue requires an hour of investigation, three to four hours of fixing, plus some time to deploy the fix, then you may want to be interrupted multiple hours ahead of time. Otherwise, there is no need to cancel a meeting or be awakened for something that can wait and be reasonably fixed the next day.
Pre/post conditions should identify issues that you would arguably want to investigate during gaps in your schedule or tasks that you could manually re-run the next day. For example, a pre/post condition could monitor connection limits to a database, which can point to sudden cascading failures.
In such cases, these should warn you early enough to not require immediate attention. You should need to address them only when they start feeling significantly broken.
Additionally, if you avoid addressing these issues immediately, you can track the rate of breakage. Though this not a critical metric, tracking it can be valuable to assess whether things are becoming more brittle over time.
A saturation threshold represents a signal of future potential issues (for example, “in three months, an issue could arise”). The goal of a saturation threshold is to be visible enough, especially during slow times, for you to schedule corrective work. As such, a saturation threshold may not need to ever page you.
When normalcy alerts involve small deviations from expectations (for example, “maybe we have a user going beyond normal limits to try something out right now”), they may resemble saturation thresholds. However, when there are huge deviations from expectations (for example, “that’s more than 10 times the very generous spend we expect; is someone actively abusing the platform?”), normalcy alerts may become full-blown incidents.
For both cases, consider creating staggered alerts–one alert that warns you via email or instant messaging, and one that pages you. If you plan to page for suspected abuse, you should also plan appropriate actions. For example, you may want to suspend the user or escalate to a team or department responsible for this type of issue.
For SLOs, your main tool to mediate and escalate alerting is the burn rate. If the budget is about to be empty in 24 hours, you should notify via email or instant messaging. If the budget is about to be empty in approximately 4 hours (a safe default), then you should page.