Set Service Level Objectives (SLOs) | Honeycomb

Set Service Level Objectives (SLOs)

Note
This feature is available as part of the Honeycomb Enterprise and Pro plans.

Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an organization to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.

SLOs are not only a technical feature, but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.

To learn more about guidelines for using SLOs and Triggers for alerting, visit Guidelines for SLOs and Trigger Alerts.

Definitions and Concepts 

A SLI is a service level indicator. It is a way of expressing, on a per-event level, whether your system is succeeding.

The SLO is the service level objective, which states how often the SLI must succeed over a given time period. An SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days.”

The Error Budget is the total number of failures tolerated by your SLO, whether measured by events or by time. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days. Another way to think about error budgets is in terms of time: if traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.

The Budget Burndown, or the remaining error budget, is the amount of unused error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you have seen 550 failing events in the last 30 days, then you have 45% of your budget remaining.

The Burn Rate is how fast you are consuming your error budget relative to your SL0. For example, a SLO that has a burn rate of 1.0 means that it will consume the error budget at a rate that will leave you with zero (0) budget at the end of the SLO window. If you consider a 30 day SLO, a consistent burn rate of 1.0 will leave you with zero (0) budget at 30 days. A SLO with a burn rate of 2.0 will leave you with zero (0) budget at 15 days. You may care about burn rate if you seek to understand the severity of issues in your SLO.

Note

Burn Rate is not the same as an error rate in a SLO. An error rate is the number of errors divided by the total number of events in the last time period. Burn rate is the ratio of the actual error rate to the expected error rate.

Burn rate helps gauge the impact that errors have on your services based on the agreed-upon reliability goals. If the burn rate is greater than 1.0, it indicates that the service is experiencing more errors than it should according to its SLOs.

A Burn Alert is an alert that signals that the error budget is being burned down rapidly.

Why Implement SLOs? 

In addition to establishing acceptable/desired service levels overall, an SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability, performance, and so on. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.

Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.

What Should SLOs Track? 

Deciding what to track using SLOs can be made easier if you keep the following tips in mind:

  • Be as close to the system’s edge as possible, then rely on tools like BubbleUp to pinpoint issues.

  • Prefer tracking based on user workflows over internal team structures. Make SLOs more relevant to actual user concerns.

    • Strike a balance between user-centricity and manageability when designing SLOs. Be prepared to revisit and adjust as needed.
    • Remember that SLOs that track user workflows may pose challenges when triaging. Several teams and services may own a workflow, and determining the root cause of an SLO burn may be difficult if there are gaps in instrumentation or you rely on alerts being sent to the proper team.
  • Focus on actionable issues worth paging on. Filter out legitimate reasons for a user to see a failure (for example, invalid credentials or user disconnects).

    • Remember that missing rare cases in SLOs is more efficient than being constantly alerted or depleting budgets with non-actionable alerts.
    • Consider using normalcy Triggers to detect unusual patterns (for example, sudden traffic drops to critical endpoints).
  • Identify load-dependent measures that remain consistent with increased input.

    For example, if you have upload endpoints that may receive large files, making the SLO’s success independent of the payload size will make the SLO more reliable. In this case, consider using transfer speed instead of response time to normalize performance across payload sizes.

  • Consider more comprehensive, broad SLOs for key performance events. This makes them self-contained and friendlier to use than creating multiple distinct SLOs or filtering various unrelated spans within a single derived column.

    For example, aim for 500 ms response times for “normal” interactive endpoints but allow a few more seconds on authentication endpoints that intentionally slow down when hashing passwords. You can benefit from this approach because the load can still impact your infrastructure and having both types of signal within the same service level indicator (SLI) can uncover weirder interactions.

  • Consider filtering out specific customers or bad actions (for example, pen-testing) as a temporary solution for better understanding. Use filtering when:

    • You are aware of the problem, and the customer is informed.
    • No immediate fix or SLO refinement is available.
    • You still want to be made aware of issues affecting the rest of the data.

How Should You Structure and Approach SLOs? 

When you structure SLOs, follow these guidelines:

  • Develop SLOs incrementally, starting with an initial signal and gradually reducing noise for clearer signals.

  • Ideally, include new code paths for existing features in existing SLOs. If a new path’s volume is so low that 100% downtime would not alert anyone, consider creating a different SLO to track it separately.

  • In cases where your observability does not align well with detecting both failures and performance issues within a single SLO, monitor effectively by developing separate SLOs–one to monitor failures and one to monitor performance issues.

  • When determining a target budget and percentage, take an iterative approach.

    • Test and fine-tune SLOs by intentionally causing failures and faults to determine if they warrant alerting.
    • Adjust sensitivity as needed. If you initially set the sensitivity too high, feel free to lower it after discussing with others.
  • Whenever possible, organize observability around features rather than code structure. For additional insights, visit our blog post: Data Availability Isn’t Observability.

  • Document any exceptions or intricacies related to the SLI within the SLO description. Doing so allows you to add more extended comments than on the SLI page itself.