We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.

X

Define and manage Service Level Objectives (SLOs)


Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an org to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.

SLOs are not only a technical feature but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.

Note: SLOs you define apply to a single Honeycomb dataset.

SLO features are currently in closed beta. Please speak to Honeycomb sales or support to learn more about early access.

Definitions and concepts

An SLI is a service-level indicator. It’s a way of expressing, on a per-event level, whether your system is succeeding.

The SLO is the service level objective, which states how often the SLI must succeed over a given time period. An SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days”

The Error Budget is the total number of failures tolerated by your SLO. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days.

Another way to think about error budgets is in terms of time: If traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.

The Burn Down, or the remaining error budget, is the amount of unusued error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you’ve seen 550 failing events in the last 30 days days, then you’ve got 45% of your budget remaining.

A Burn Alert is an alert that signals that the error budget is being burned down rapidly.

Why implement SLOs?

In addition to establishing acceptable/desired service levels overall, an SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability/performance/etc. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.

Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.

Overview of the SLO process

To define and measure your SLO in Honeycomb, you will do the following:

  1. Create a derived column that returns true, false, or null to represent your SLI.
  2. Use that derived column to define your SLO in Dataset Settings | SLOs, or under the main SLO list.
  3. Monitor all the SLOs for your team from the SLOs page, or click the handshake icon in the left hand menu.

Define the SLI with a derived column

An SLI reports whether an event is “successful” or not in terms of the goals of the SLO. Before you configure the SLO, you must define the indicator that it uses to evaluate your level of success. To do this, you create a derived column in Honeycomb that evaluates ‘success’ as you’ve defined it and returns True (for successful), False (for failed) or null (for not applicable) for each event in the dataset.

To identify a suitable SLI, first express it in terms of user goals, such as “a user should be able to load our home page and see a result quickly”.

Identify qualified events – which events contain information about the SLI. In this example, these are events where request.path = “/home”.

For those events, the criterion for your SLI determines which are considered “successful”. In this case, success means duration_ms < 100.

Create a derived column to measure the SLI

Now, create a derived column that reflects this qualifier and criterion. To create the SLI derived column, go to the dataset containing the data from which this SLI and associated SLO will be calculated and click the Derived Columns tab under Dataset Settings | Schema. For more detailed documentation, refer to the documentation for creating derived columns.

Honeycomb’s two-argument “IF” command can be convenient for this: IF( $a, $b) returns $b only if $a is true; otherwise, it returns null. Therefore, most SLIs are written as IF( qualifier, criterion)

Continuing with the previous example, the derived column for this SLI would look similar to:

IF( EQUALS( $request.path, “/home”), LT( $http.response_duration, 100))

Refer to “Creating an SLI Derived Column” in the SLI Cookbook for more examples.

Test the SLI

Query the associated dataset for a COUNT and a HEATMAP(duration_ms), broken down by the SLI derived column.

Confirm that you see three groups: true, false and blank. (Blank events are those that are not qualified.) Your current level is approximately #true / (#true + #false).

Flip among the three groups and confirm that they look right for your use case and understanding of the dataset’s contents.

Define your SLO

Create an SLO on the SLO tab of the Datasets page.

To define your SLO, answer the following question, where “qualified events” are as defined in your SLI:

Over what period of time do you expect what percentage of qualified events to pass the SLO?

For example, “I expect that 99% of qualified events will succeed over every 30 days.” As you select a level, note your current state, which you can find out by doing a count query grouped by your SLI derived column.

Monitor your SLOs

Access all SLOs for the current team from the left nav bar, below the triggers bell.

To see the details of a particular SLO, click on it in the SLO list.

An SLO display has four components:

Define Burn Alerts

SLOs are especially useful when they warn you of upcoming issues. Honeycomb Burn Alerts warn you when your SLO budget will be exhausted in a certain amount of time.

Choose the length of time for a given burn alert based on the context and goals of your organization. A 24 hour burn alert can be useful to know if services quality is slowly degrading (and so might be best sent to Slack); a 4 hour alert can be useful to know if there is an urgent issue (and might go to PagerDuty).

Honeycomb computes burn alerts by interpolating the current rate of budget burn over the previous exhaustion_time / 4 period. A 24 hour burn alert will fire when the trend over the last 6 hours implies a failure. A burn alert will stay fired until the SLO budget returns to the exhaustion time (plus a small buffer, to keep from flapping).

For each SLO, click “Burn Alerts” to add a burn alert. The Burn Alert endpoints list is populated from the Triggers list. Learn how to add trigger integrations from Team Settings -> Integrations.

Burns Alerts are measured in hours. While it is possible to express fractional hours (0.25 corresponds to 15 minutes, for example), our experience is that burn alerts are most useful set at zero – that is, notify you when out of budget – or ranging from an hour to a few days.

For periods less than an hour, there isn’t enough time to react in order to make the SLO actionable. Conversely, for periods more than a few days, it almost never merits notification – instead, it effectively becomes the current SLO measurement.

Usage notes