Honeycomb SLO allows you to define and monitor Service Level Objectives (SLOs) for your organization. SLOs allow you to define and enforce an agreement between two parties regarding the delivery of a given service. In many cases, SLOs are defined between service providers and customers, but are also useful when used within an organization to clarify agreed-upon priorities for service delivery and feature/bug/feature debt work.
SLOs are not only a technical feature, but also a philosophy of monitoring and managing systems as articulated in the Google SRE book among other sources. Using Honeycomb SLO, you can describe and implement SLOs and be alerted in a timely and appropriate way.
A SLI is a service level indicator. It is a way of expressing, on a per-event level, whether your system is succeeding.
The SLO is the service level objective, which states how often the SLI must succeed over a given time period. An SLO is expressed as a percentage or ratio over a rolling time window, such as “99.9% for any given thirty days.”
The Error Budget is the total number of failures tolerated by your SLO, whether measured by events or by time. For example, if a million events come in over 30 days, then a 99.9% compliance level means you can have 1,000 failed events over those thirty days. Another way to think about error budgets is in terms of time: if traffic is uniform, and there are no brownouts or partial failures, 1% means roughly 7 hours of downtime. Further, 99.9% means 44 minutes of downtime per month.
The Budget Burndown, or the remaining error budget, is the amount of unused error budget in the current time period. Consider again the service that gets a million hits in 30 days, maintaining a 99.9% level over trailing 30 days. If you have seen 550 failing events in the last 30 days, then you have 45% of your budget remaining.
A Burn Alert is an alert that signals that the error budget is being burned down rapidly.
In addition to establishing acceptable/desired service levels overall, an SLO provides context that allows different members of the team to make good decisions about things related to the service and its availability, performance, and so on. For example, the on-call team will be able to tell if an error is important enough to get out of bed for, and management will be able to report with precision on just how degraded a service is. SLOs can also provide an agreed-upon set of priorities for the organization to use when making decisions to develop new functionality or fix bugs vs invest in infrastructure upgrades, and so on.
Having SLOs means you can make clear and accurate statements about the impact of both production incidents and development activities on your overall quality of service.