This feature is available as part of the Honeycomb Enterprise and Pro plans.
An SLO is defined over a single Honeycomb dataset. To define and measure your SLO in Honeycomb, you will do the following:
true
, false
, or null
to represent your service level indicator (SLI).An SLI reports whether an event is “successful” or not in terms of the goals of the SLO.
Before you configure the SLO, you must define the indicator that it uses to evaluate your level of success.
To do this, you create a derived column in Honeycomb that evaluates “success” as you have defined it and returns true
(for successful), false
(for failed) or null
(for not applicable) for each event in the dataset.
To identify a suitable SLI, first express it in terms of user goals, such as “a user should be able to load our home page and see a result quickly.”
Identify qualified events, or which events contain information about the SLI.
In this example, our qualified events are events where request.path = “/home”
.
For those events, the criterion for your SLI determines which events are considered “successful”.
In this case, success means duration_ms < 100
.
If an event is not qualified, then null
is returned.
If an event is qualified, then whether the event passes the criterion or not is returned.
Now, create a derived column that reflects this qualifier and criterion. To create the SLI derived column, go to Dataset Settings for the dataset, where this SLI and associated SLO will be calculated. Under the Schema tab, select Derived Columns. For more detailed documentation, refer to the documentation for creating derived columns.
Honeycomb’s two-argument “IF” command can be convenient for your derived column creation: IF( $a, $b)
returns $b
only if $a
is true
; otherwise, it returns null
.
Therefore, most SLIs are written as IF( qualifier, criterion)
Continuing with the previous example, the derived column for this SLI would look similar to:
IF( EQUALS( $request.path, “/home”), LT( $http.response_duration, 100))
Refer to SLI Formulas for more examples.
To test your SLI, query the associated dataset for a COUNT and a HEATMAP(duration_ms)
, broken down by the SLI derived column.
Confirm that you see three groups: true
, false
and blank.
(Blank events are those that are not qualified.)
Your current level is approximately #true / (#true + #false)
.
Confirm that the three groups look correct for your use case and understanding of the dataset’s contents.
This process is illustrated in our blog entry, Working Toward Service Level Objectives.
To define your SLO, answer the following question, where “qualified events” are as defined in your SLI:
Over what period of time do you expect what percentage of qualified events to pass the SLO?
For example, “I expect that 99% of qualified events will succeed over every 30 days.” As you select a level, base it off your current state, which you can find out by doing a count query grouped by your SLI derived column.
You can get to the list of all SLOs across all datasets by selecting the SLO icon. Select “New SLO” to create a new SLO.
You can also create an SLO on the SLOs tab of the Datasets page.
Complete the form to create your SLO.
Access all SLOs for the current team from the left navigation bar by using the Handshake icon. The SLO list view shows information for each SLO, including:
Current - Shows the current historical compliance for the SLO.
Budget - Shows the current budget burndown for the SLO.
Burn Status - Shows if a burn alert is currently triggered, indicating that the error budget is rapidly burning down. When triggered, the burn alert with the shortest configured exhaustion time also displays here.
To see the details of a particular SLO, select the SLO within the SLO list.
An SLO detailed display has four components:
Budget Burndown - Looking at your current period: For every day in the time range, how much budget was used on that day? How much budget is left? If > 0, you have succeeded at your SLO for this period. This is computed as a rolling window, so every moment is based on the preceding time. It is a burndown, showing the cumulative error against the budget.
Historical SLO Compliance - For each day in the past SLO period days, if you started there and went back another SLO period days, what percentage of successful events would you cumulatively see? How does it compare to our SLO target?
Heatmap - This display shows events that succeed the SLI (in blue-green) and that fail (in yellow) on a heatmap of duration. The time axis runs over a much shorter period than the full SLO period, which allows you to focus on individual events and times that set off the SLO. You can adjust which column is used with the dropdown; you can adjust the time range with the time selector.
BubbleUp - The BubbleUp series at the bottom of the screen shows the dimensions where the events that pass the SLI (in blue) and those that fail it (yellow) are most different. This information can provide insight into the causes for current burndown activity. The BubbleUp looks at the same time period as the heatmap.
SLOs are especially useful when they warn you of upcoming issues. Honeycomb Burn Alerts warn you when your SLO budget will be exhausted in a certain amount of time.
Choose the length of time for a given burn alert based on the context and goals of your organization. A 24 hour burn alert can be useful to know if services quality is slowly degrading (and so might be best sent to Slack); a 4 hour alert can be useful to know if there is an urgent issue (and might go to PagerDuty).
Honeycomb computes burn alerts by extrapolating the current rate of budget burn by dividing the previous exhaustion time by 4. A 24 hour burn alert will fire when the trend over the last 6 hours implies a failure. A burn alert will stay fired until the SLO budget returns to the exhaustion time (plus a small buffer, to keep from flapping).
To add a burn alert, select the SLO’s name in the SLO page to view its details. Select Configure Alerts, which will display the SLO’s existing Burn Alerts (if any) in list view. Select New Burn Alert and complete the form to configure your new Burn Alert’s exhaustion time and notify option.
The list of notify options is populated from Trigger Recipients, as found under Team Settings > Integrations.
Burns Alerts are measured in hours. While it is possible to express fractional hours (0.25 corresponds to 15 minutes, for example), our experience is that burn alerts are most useful set at zero – that is, notify you when out of budget – or ranging from an hour to a few days.
For periods less than an hour, there is not enough time to react in order to make the SLO actionable. Conversely, for periods more than a few days, it almost never merits notification – instead, it effectively becomes the current SLO measurement.
Burn Alerts will only trigger if you have budget remaining. If you have blown your error budget due to some issue and then fixed the problem, it is worth resetting your error budget so burn alerts will start working again.
You can reset your budget back to 100% by clicking the “Reset” button under the Budget Burndown chart:
Selecting it will erase all errors that have happened in the current SLO time period, up to and including the current hour. For example, if you have a 30 day SLO, you will be back to 100% for that 30 days. This will affect both your Budget Burndown and Historical SLO Compliance graphs on the Summary view, as well as the Current Percentage displayed in the SLO lists.
While Honeycomb will track SLO values past your retention period, this only works for the Budget Burndown and the Historical Compliance graphs. You cannot use the Bubbleup or the heatmap to look at times beyond your retention period.
You may only have one SLO attached to any SLI derived column. For example, you may not have both a 30 day and a 60 day SLO attached to the same SLI column. You may have as many burn alerts attached to that SLO as you wish. (If you do find yourself needing more than one SLO attached to any SLI derived column, please contact Honeycomb for support; we would like to understand that scenario better!)
SLOs are most effective when you have a reasonably high volume of data: a small number of failures in an hour should not make a major dent in your reliability.
You should have fairly few SLOs for any dataset. Currently, the interface limits you to 30. SLOs should describe interfaces to a system rather than (say) customers. Customers should roughly have similar behavior to each other; if groups of customers have properties that set them apart from others, try to write SLOs against those properties instead.
Did you find what you were looking for?