Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an org to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.
SLOs are not only a technical feature but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.
Note: SLOs you define apply to a single Honeycomb dataset.
SLO features are available as part of the Honeycomb Enterprise plan. Please see Honeycomb pricing or contact us if you are not on an Enterprise plan.
An SLI is a service-level indicator. It’s a way of expressing, on a per-event level, whether your system is succeeding.
The SLO is the service level objective, which states how often the SLI must succeed over a given time period. An SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days”
The Error Budget is the total number of failures tolerated by your SLO. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days.
Another way to think about error budgets is in terms of time: If traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.
The Burn Down, or the remaining error budget, is the amount of unusued error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you’ve seen 550 failing events in the last 30 days days, then you’ve got 45% of your budget remaining.
A Burn Alert is an alert that signals that the error budget is being burned down rapidly.
In addition to establishing acceptable/desired service levels overall, an SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability/performance/etc. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.
Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.
To define and measure your SLO in Honeycomb, you will do the following:
An SLI reports whether an event is “successful” or not in terms of the goals of the SLO. Before you configure the SLO, you must define the indicator that it uses to evaluate your level of success. To do this, you create a derived column in Honeycomb that evaluates ‘success’ as you’ve defined it and returns True (for successful), False (for failed) or null (for not applicable) for each event in the dataset.
Identify qualified events – which events contain information about the SLI. In this example, these are events where
request.path = “/home”.
For those events, the criterion for your SLI determines which are considered “successful”. In this case, success means
duration_ms < 100.
Now, create a derived column that reflects this qualifier and criterion. To create the SLI derived column, go to the dataset containing the data from which this SLI and associated SLO will be calculated and click the Derived Columns tab under Dataset Settings | Schema. For more detailed documentation, refer to the documentation for creating derived columns.
Honeycomb’s two-argument “IF” command can be convenient for this:
IF( $a, $b) returns
$b only if
$a is true; otherwise, it returns null. Therefore, most SLIs are written as
IF( qualifier, criterion)
Continuing with the previous example, the derived column for this SLI would look similar to:
IF( EQUALS( $request.path, “/home”), LT( $http.response_duration, 100))
Refer to “Creating an SLI Derived Column” in the SLI Cookbook for more examples.
Query the associated dataset for a COUNT and a
HEATMAP(duration_ms), broken down by the SLI derived column.
Confirm that you see three groups:
false and blank. (Blank events are those that are not qualified.) Your current level is approximately
#true / (#true + #false).
Flip among the three groups and confirm that they look right for your use case and understanding of the dataset’s contents.
This process is illustrated in our blog entry, Working Toward Service Level Objectives.
You can get to the list of all SLOs across all datasets by clicking on the SLO icon. Click “new SLO” to create a new SLO.
You can also create an SLO on the SLO tab of the Datasets page.
To define your SLO, answer the following question, where “qualified events” are as defined in your SLI:
Over what period of time do you expect what percentage of qualified events to pass the SLO?
For example, “I expect that 99% of qualified events will succeed over every 30 days.” As you select a level, note your current state, which you can find out by doing a count query grouped by your SLI derived column.
Access all SLOs for the current team from the left nav bar, below the triggers bell.
To see the details of a particular SLO, click on it in the SLO list.
An SLO display has four components:
BUDGET BURNDOWN: Looking at your current period: For every day in the time range, how much budget was used on that day? How much budget is left? If > 0, you’ve succeeded at your SLO for this period. This is computed as a rolling window, so every moment is based on the preceding time. It is a burndown, showing the cumulative error against the budget.
HISTORICAL COMPLIANCE: For each day in the past SLO period days, if you started there and went back another SLO period days, what percentage of successful events would you cumulatively see? How does it compare to our SLO target?
HEATMAP: This shows events that succeed the SLI (in blue-green) and that fail (in yellow) on a heatmap of duration. The time axis runs over a much shorter period than the full SLO period, under the logic that it’s recent events that are most interesting. You can adjust which column is used with the dropdown; you can adjust the time range with the time selector.
BUBBLEUP: The BubbleUp series at the bottom of the screen shows the dimensions where the events that pass the SLI (in blue) and those that fail it (yellow) are most different. This information can provide insight into the causes for current burn down activity. The BubbleUp looks at the same time period as the heatmap.
SLOs are especially useful when they warn you of upcoming issues. Honeycomb Burn Alerts warn you when your SLO budget will be exhausted in a certain amount of time.
Choose the length of time for a given burn alert based on the context and goals of your organization. A 24 hour burn alert can be useful to know if services quality is slowly degrading (and so might be best sent to Slack); a 4 hour alert can be useful to know if there is an urgent issue (and might go to PagerDuty).
Honeycomb computes burn alerts by interpolating the current rate of budget burn over the previous exhaustion_time / 4 period. A 24 hour burn alert will fire when the trend over the last 6 hours implies a failure. A burn alert will stay fired until the SLO budget returns to the exhaustion time (plus a small buffer, to keep from flapping).
For each SLO, click “Burn Alerts” to add a burn alert. The Burn Alert endpoints list is populated from the Triggers list. Learn how to add trigger integrations from Team Settings -> Integrations.
Burns Alerts are measured in hours. While it is possible to express fractional hours (0.25 corresponds to 15 minutes, for example), our experience is that burn alerts are most useful set at zero – that is, notify you when out of budget – or ranging from an hour to a few days.
For periods less than an hour, there isn’t enough time to react in order to make the SLO actionable. Conversely, for periods more than a few days, it almost never merits notification – instead, it effectively becomes the current SLO measurement.
While Honeycomb will track SLO values past your retention period, this only works for the Budget Burndown and the Historical Compliance graphs. You cannot use the Bubbleup or the heatmap to look at times beyond your retention period.
You may only have one SLO attached to any SLI derived column. For example, you may not have both a 30 day and a 60 day SLO attached to the same SLI column. You may have as many burn alerts attached to that SLO as you wish. (If you do find yourself needing this, please contact Honeycomb for support; we’d like to understand that scenario better!)
SLOs are most effective when you have a reasonably high volume of data: a small number of failures in an hour should not make a major dent in your reliability.
You should have fairly few SLOs for any dataset. Currently, the interface limits you to 15. SLOs should describe interfaces to a system rather than (say) customers. Customers should roughly have similar behavior to each other; if groups of customers have properties that set them apart from others, try to write SLOs against those properties instead.
If you’re using the Honeycomb Refinery beta and want to define an SLO in Honeycomb, contact customer support for assistance creating your SLI derived column to account for trace-based sampling.