Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an org to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.
SLOs are not only a technical feature but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.
Note: SLOs you define apply to a single Honeycomb dataset.
An SLI is a service-level indicator. It’s a way of expressing, on a per-event level, whether your system is succeeding.
The SLO is the service level objective, which states how often the SLI must succeed over a given time period. An SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days”
The Error Budget is the total number of failures tolerated by your SLO. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days.
Another way to think about error budgets is in terms of time: If traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.
The Burn Down, or the remaining error budget, is the amount of unusued error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you’ve seen 550 failing events in the last 30 days days, then you’ve got 45% of your budget remaining.
A Burn Alert is an alert that signals that the error budget is being burned down rapidly.
In addition to establishing acceptable/desired service levels overall, an SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability/performance/etc. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.
Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.
To define and measure your SLO in Honeycomb, you will do the following:
An SLI reports whether an event is “successful” or not in terms of the goals of the SLO. Before you configure the SLO, you must define the indicator that it uses to evaluate your level of success. To do this, you create a derived column in Honeycomb that evaluates ‘success’ as you’ve defined it and returns True (for successful), False (for failed) or null (for not applicable) for each event in the dataset.
Identify qualified events – which events contain information about the SLI. In this example, these are events where
request.path = “/home”.
For those events, the criterion for your SLI determines which are considered “successful”. In this case, success means
duration_ms < 100.
Now, create a derived column that reflects this qualifier and criterion. To create the SLI derived column, go to the dataset containing the data from which this SLI and associated SLO will be calculated and click the Derived Columns tab under Dataset Settings | Schema. For more detailed documentation, refer to the documentation for creating derived columns.
Honeycomb’s two-argument “IF” command can be convenient for this:
IF( $a, $b) returns
$b only if
$a is true; otherwise, it returns null. Therefore, most SLIs are written as
IF( qualifier, criterion)
Continuing with the previous example, the derived column for this SLI would look similar to:
IF( EQUALS( $request.path, “/home”), LT( $http.response_duration, 100))
Refer to “Creating an SLI Derived Column” in the SLI Cookbook for more examples.
Query the associated dataset for a COUNT and a
HEATMAP(duration_ms), broken down by the SLI derived column.
Confirm that you see three groups:
false and blank. (Blank events are those that are not qualified.) Your current level is approximately
#true / (#true + #false).
Flip among the three groups and confirm that they look right for your use case and understanding of the dataset’s contents.
Create an SLO on the SLO tab of the Datasets page.
To define your SLO, answer the following question, where “qualified events” are as defined in your SLI:
Over what period of time do you expect what percentage of qualified events to pass the SLO?
For example, “I expect that 99% of qualified events will succeed over every 30 days.” As you select a level, note your current state, which you can find out by doing a count query grouped by your SLI derived column.
Access all SLOs for the current team from the left nav bar, below the triggers bell.
To see the details of a particular SLO, click on it in the SLO list.
An SLO display has four components:
REMAINING BUDGET: Looking at your current period: For every day in the time range, how much budget was used on that day? How much budget is left? If > 0, you’ve succeeded at your SLO for this period. If < 0, you’ve failed. This is computed as a rolling window, so every moment is based on the preceding period.
HISTORICAL COMPLIANCE: For each day in the past SLO period days, if you started there and went back another SLO period days, what percentage of successful events would you cumulatively see? How does it compare to our SLO target?
HEATMAP: This shows events that succeed the SLI (in blue-green) and that fail (in yellow) on a heatmap of duration. The time axis runs over a much shorter period than the full SLO period, under the logic that it’s recent events that are most interesting. It uses a log10 scale, so that “2” means “100”, and “4” means “10,000”.
BUBBLEUP: The BubbleUp series at the bottom of the screen shows the dimensions where the events that pass the SLI (in blue) and those that fail it (yellow) are most different. This information can provide insight into the causes for current burn down activity.
SLOs are especially useful when they warn you of upcoming issues. Honeycomb Burn Alerts warn you when your SLO budget will be exhausted in a certain amount of time.
Choose the length of time for a given burn alert based on the context and goals of your organization. A 24 hour burn alert can be useful to know if services quality is slowly degrading (and so might be best sent to Slack); a 4 hour alert can be useful to know if there is an urgent issue (and might go to PagerDuty).
Honeycomb computes burn alerts by interpolating the current rate of budget burn over the previous exhaustion_time / 4 period. A 24 hour burn alert will fire when the trend over the last 6 hours implies a failure. A burn alert will stay fired until the SLO budget returns to the exhaustion time (plus a small buffer, to keep from flapping).
For each SLO, click “Burn Alerts” to add a burn alert. The Burn Alert endpoints list is populated from the Triggers list. Learn how to add trigger integrations from Team Settings -> Integrations.
If you’re using Honeycomb Refinery and want to define an SLO in Honeycomb, contact customer support for assistance creating your SLI derived column to account for trace-based sampling.
The derived columns used to evaluate SLIs operate on data within your current retention period only. If you do not retain 30 days of data, you will not be able to develop a 30 day SLO/error budget. We’re exploring options to improve this.