We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

Define and Manage Service Level Objectives (SLOs)

This feature is available as part of the Honeycomb Enterprise plan.

Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an organization to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.

SLOs are not only a technical feature but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.

Note: SLOs that you define apply to a single Honeycomb dataset.

Definitions and Concepts  🔗

A SLI is a service-level indicator. It’s a way of expressing, on a per-event level, whether your system is succeeding.

The SLO is the service level objective, which states how often the SLI must succeed over a given time period. A SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days.”

The Error Budget is the total number of failures tolerated by your SLO. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days. Another way to think about error budgets is in terms of time: If traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.

The Burn Down, or the remaining error budget, is the amount of unused error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you’ve seen 550 failing events in the last 30 days days, then you’ve got 45% of your budget remaining.

A Burn Alert is an alert that signals that the error budget is being burned down rapidly.

Why Implement SLOs?  🔗

In addition to establishing acceptable/desired service levels overall, a SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability, performance, and so on. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.

Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.

Overview of the SLO Process  🔗

To define and measure your SLO in Honeycomb, you will do the following:

  1. Create a derived column that returns true, false, or null to represent your SLI.
  2. Use that derived column to define your SLO in Dataset Settings > SLOs, or under the main SLO list.
  3. Monitor all the SLOs for your team from the SLOs page, or click the handshake icon in the left hand menu.

Define the SLI with a Derived Column  🔗

A SLI reports whether an event is “successful” or not in terms of the goals of the SLO. Before you configure the SLO, you must define the indicator that it uses to evaluate your level of success. To do this, you create a derived column in Honeycomb that evaluates ‘success’ as you’ve defined it and returns true (for successful), false (for failed) or null (for not applicable) for each event in the dataset.

To identify a suitable SLI, first express it in terms of user goals, such as “a user should be able to load our home page and see a result quickly”.

Identify qualified events, or which events contain information about the SLI. In this example, these are events where request.path = “/home”.

For those events, the criterion for your SLI determines which events are considered “successful”. In this case, success means duration_ms < 100.

Create a Derived Column to Measure the SLI  🔗

Now, create a derived column that reflects this qualifier and criterion. To create the SLI derived column, go to the dataset containing the data from which this SLI and associated SLO will be calculated and click the Derived Columns tab under Dataset Settings > Schema. For more detailed documentation, refer to the documentation for creating derived columns.

Honeycomb’s two-argument “IF” command can be convenient for this: IF( $a, $b) returns $b only if $a is true; otherwise, it returns null. Therefore, most SLIs are written as IF( qualifier, criterion)

Continuing with the previous example, the derived column for this SLI would look similar to:

IF( EQUALS( $request.path, “/home”), LT( $http.response_duration, 100))

Refer to “Creating a SLI Derived Column” in the SLI Cookbook for more examples.

Test the SLI  🔗

Query the associated dataset for a COUNT and a HEATMAP(duration_ms), broken down by the SLI derived column.

Confirm that you see three groups: true, false and blank. (Blank events are those that are not qualified.) Your current level is approximately #true / (#true + #false).

Flip among the three groups and confirm that they look correct for your use case and understanding of the dataset’s contents.

This process is illustrated in our blog entry, Working Toward Service Level Objectives.

Define Your SLO  🔗

You can get to the list of all SLOs across all datasets by selecting the SLO icon. Select “New SLO” to create a new SLO.

SLO from the menu

You can also create a SLO on the SLO tab of the Datasets page.

SLO from a dataset

To define your SLO, answer the following question, where “qualified events” are as defined in your SLI:

Over what period of time do you expect what percentage of qualified events to pass the SLO?

For example, “I expect that 99% of qualified events will succeed over every 30 days.” As you select a level, note your current state, which you can find out by doing a count query grouped by your SLI derived column.

SLO creation dialog

Monitor Your SLOs  🔗

Access all SLOs for the current team from the left navigation bar, below the Triggers bell.

To see the details of a particular SLO, select this SLO within the SLO list.

A SLO display has four components:

Looking at your current period: For every day in the time range, how much budget was used on that day? How much budget is left? If > 0, you’ve succeeded at your SLO for this period. This is computed as a rolling window, so every moment is based on the preceding time. It is a burndown, showing the cumulative error against the budget.
For each day in the past SLO period days, if you started there and went back another SLO period days, what percentage of successful events would you cumulatively see? How does it compare to our SLO target?
This shows events that succeed the SLI (in blue-green) and that fail (in yellow) on a heatmap of duration. The time axis runs over a much shorter period than the full SLO period, under the logic that its recent events that are most interesting. You can adjust which column is used with the dropdown; you can adjust the time range with the time selector.
The BubbleUp series at the bottom of the screen shows the dimensions where the events that pass the SLI (in blue) and those that fail it (yellow) are most different. This information can provide insight into the causes for current burn down activity. The BubbleUp looks at the same time period as the heatmap.

SLO summary view

Define Burn Alerts  🔗

SLOs are especially useful when they warn you of upcoming issues. Honeycomb Burn Alerts warn you when your SLO budget will be exhausted in a certain amount of time.

Choose the length of time for a given burn alert based on the context and goals of your organization. A 24 hour burn alert can be useful to know if services quality is slowly degrading (and so might be best sent to Slack); a 4 hour alert can be useful to know if there is an urgent issue (and might go to PagerDuty).

Honeycomb computes burn alerts by interpolating the current rate of budget burn over the previous exhaustion_time / 4 period. A 24 hour burn alert will fire when the trend over the last 6 hours implies a failure. A burn alert will stay fired until the SLO budget returns to the exhaustion time (plus a small buffer, to keep from flapping).

For each SLO, click “Burn Alerts” to add a burn alert. The Burn Alert endpoints list is populated from the Triggers list. Learn how to add trigger integrations from Team Settings > Integrations.

SLO Burn Alert fired

Burns Alerts are measured in hours. While it is possible to express fractional hours (0.25 corresponds to 15 minutes, for example), our experience is that burn alerts are most useful set at zero – that is, notify you when out of budget – or ranging from an hour to a few days.

For periods less than an hour, there isn’t enough time to react in order to make the SLO actionable. Conversely, for periods more than a few days, it almost never merits notification – instead, it effectively becomes the current SLO measurement.

SLO Burn Alert creation

Reset Your Remaining Budget  🔗

Burn Alerts will only trigger if you have budget remaining. If you’ve blown your error budget due to some issue and then fixed the problem, it’s worth resetting your error budget so burn alerts will start working again.

You can reset your budget back to 100% by clicking the “Reset” button under the Budget Burndown chart:

SLO Reset button

Selecting it will erase all errors that have happened in the current SLO time period, up to and including the current hour. For example, if you have a 30 day SLO, you will be back to 100% for that 30 days. This will affect both your Budget Burndown and Historical Compliance graphs on the Summary view, as well as the Current Percentage displayed in the SLO lists.

Best Practices and Usage Notes  🔗