Service Level Objectives (SLOs)


Define, monitor, and maintain your service reliability goals.

Important
This feature is available in the Honeycomb Enterprise and Pro plans.

What is an SLO? 

A Service Level Objective (SLO) defines the expected level of reliability of a service. Often it’s an agreement between a service provider and its customers, but SLOs can also be used internally to set priorities across teams.

SLOs combine both practice and philosophy. They bring discipline to how teams monitor and manage systems, following principles described in the Google SRE book.

For best practices when using SLOs and Triggers for alerting, visit Guidelines for SLOs and Trigger Alerts.

Learning Resource
For more structured learning, check out the Add icon Service Level Objectives course from Honeycomb Academy.

Key Concepts 

Key concepts help you understand the building blocks of Service Level Objectives (SLOs). These terms define how SLOs are structured and how they interact with one another.

  • Service Level Indicator (SLI): A per-event measurement that defines whether your system succeeded or failed.

  • Service Level Objective (SLO): The target proportion, expressed as a percentage or ratio, of successful SLIs over a rolling time window.
    Example: “99.9% for any given 30 days”.

  • Error Budget: The allowable amount of failure within the SLO window, measured by events or by time. Example: At 99.9% compliance with 1 million events in 30 days, you can tolerate 1,000 failed events. Viewed as downtime, with uniform traffic and no brownouts or partial failures, 99% availability allows ~7 hours of downtime in 30 days, while 99.9% allows ~44 minutes.

  • Budget Burndown: The remaining portion of the error budget within the current time window. Example: If 550 failed events occurred in 30 days at 99.9%, then you have 45% of the budget left.

  • Burn Rate: How quickly you are consuming the error budget compared to the target. Burn rate helps you understand the severity of issues in your SLO. A burn rate of 1.0 consumes the budget evenly and depletes the error budget exactly within the SLO window. A burn rate of 2.0 depletes it twice as fast.

    Tip

    Burn Rate is not the same as an error rate in an SLO.

    • Error rate = number of errors / total events in the last time period
    • Burn rate = actual error rate / expected error rate

    If burn rate > 1.0, your service is burning through its error budget faster than expected.

  • Burn Alert: An alert triggered when the error budget is consumed unusually quickly.

How SLOs Work 

SLOs let you define measurable reliability goals for your services and track them automatically in Honeycomb. Each SLO uses a Service Level Indicator (SLI) to measure success at the event level, and calculates:

  • Error Budget: How many failures your service can tolerate within the SLO window.
  • Budget Burndown: How much of the error budget remains at any point in time.
  • Burn Rate: How quickly the error budget is being consumed.

When the error budget is at risk, Honeycomb can alert your team so you can investigate and respond quickly.

Burn Alerts 

Burn alerts notify you when your error budget is depleting faster than expected, helping you prioritize incidents and prevent SLO violations.

Multiple Services 

You can apply a single SLO across multiple services, aggregating events from selected services to get a holistic view of system reliability. Multi-service SLOs work with environment-level Calculated Fields for consistent measurement.

Notifications 

When a trigger fires, it alerts you via the configured method(s). Supported methods include:

Tags 

Tags help you stay organized as your Team creates more SLOs. Use them to group related SLOs by team, project, service, or any other category that fits your workflow.

Tags make it easy to filter and find the SLOs you need, especially in shared or busy environments. Because tags are flexible and customizable, you can organize SLOs in the way that works best for you.

Why Implement SLOs? 

SLOs do more that set reliability targets. They provide shared context for decision-making across teams:

  • On-call engineers can prioritize incidents effectively.
  • Management can measure and report on service quality with precision.
  • Product and engineering teams can balance new feature work against infrastructure needs.

SLOs make reliability a measurable, shared concern.

Designing Effective SLOs 

Designing effective SLOs means choosing objectives that reflect what matters most to your users. This involves aligning reliability goals with business priorities, selecting clear SLIs, and setting thresholds that balance risk and user experience.

What to Track 

When deciding what to measure:

  • Measure close to the user: Track signals at the system’s edge, then use BubbleUp to pinpoint issues.

  • Design around user workflows: Prioritize user outcomes over team or service boundaries. Expect some triage challenge if ownership is spread across teams or instrumentation is incomplete, and be prepared to revisit and adjust as needed.

  • Alert only on actionable issues: Exclude known/expected failures (for example, invalid credentials or user disconnects) to avoid noise. Missing rare cases in SLOs is more efficient than depleting budgets with constant, non-actionable alerts. Use normalcy Triggers to detect unusual patterns like sudden traffic drops to critical endpoints.

  • Normalize load-dependent metrics: Identify load-dependent measures that remain consistent with increased output. Example: For upload endpoints that may receive large files, SLOs will be more reliable if their success is independent of payload size. Use transfer speed instead of response time to ensure fair measurement.

  • Favor broad, meaningful SLOs: A single comprehensive objective often provides more insight and ease of use than many fragmented ones or filtering various unrelated spans within a single calculated field. Example: Aim for 500 ms response times for “normal” interactive endpoints but allow a few more seconds on authentication endpoints that intentionally slow down when hashing passwords. This way, the load can still impact your infrastructure and having both types of signal within the same service level indicator (SLI) can uncover weirder interactions.

  • Filter selectively: Exclude specific customers or known problematic traffic (for example, pen-testing) when it does not reflect true service health. Use filtering when:

    • You are aware of the problem, and the customer is informed.
    • No immediate fix or SLO refinement is available.
    • You still want to be made aware of issues affecting the rest of the data.

Structuring SLOs 

Keep these guidelines in mind:

  • Iterate: Start with an initial signal, reduce noise, and refine over time.

  • Include new code paths: Add them to existing SLOs when traffic volume is significant. Use separate SLOs only for low-volume but critical paths.

  • Separate concerns: Create distinct SLOs for performance vs. availability when needed.

  • Set and tune targets: Test by injecting failures and faults to ensure alerts trigger appropriately. Adjust sensitivity iteratively based on team feedback.

  • Organize around user features: Focus on user-facing outcomes, not code structure. For additional insights, visit our blog post: Data Availability Isn’t Observability.

  • Document details: Capture exceptions and SLI intricacies within the SLO description, which allows for more extended comments than the SLI page.

Multi-Service SLOs 

Note
Shared SLOs across datasets and services require the Environments and Services data model. To use this feature, migrate from Honeycomb Classic if needed.

Honeycomb supports SLOs that share a single error budget across multiple services.

How Multi-Service SLOs Work 

Multi-service SLOs let you define reliability targets that cover multiple services, capturing the combined user experience across related systems. This ensures that issues in any critical part of a workflow are reflected in the overall objective.

Key characteristics include:

  • Share a single error budget across up to 10 services.
  • Only events from included services are evaluated.
  • Traffic from all included services is weighted equally.
  • SLIs are defined as environment-level calculated fields.

To learn how to query on these SLIs in Query Builder, visit our example of Calculated Fields in multiple datasets.

Use Cases 

While most SLOs are best defined on a single edge service (the service that is closest to your end user), multi-service SLOs are useful for:

Use Case Description
Multiple edge services Users connect from many locations rather than from one centralized place (for example, service meshes or API Gateways).
Migrating from a monolith to microservices During gradual migration, create SLOs that cover both legacy and new components.
Hot paths for critical flows Define SLOs across services that form essential user workflows.

Evaluating Your Use Case 

Follow these guidelines to determine whether a multi-service SLO is appropriate for your scenario.

Can success/failure be defined from a single event? 

Honeycomb classifies individual events as successful or failed. SLOs that require relationships across multiple events are not supported.

  • Supported scenario: SLO includes frontend and cart services. SLI defines success as events with duration_ms < 50 ms.

    Independent SLO events for multiple services

    Because events can be categorized as successful or failed, Honeycomb supports this use case.

  • Unsupported scenario: SLO includes frontend and cart services, but success depends on a cart event being a child of a frontend event: events with duration_ms < 50 ms on cart events that are a child span of frontend.

    Dependent SLO events for multiple services

    This scenario requires combining two events to determine success or failure, so Honeycomb does not support this use case.

Do you want an SLO across all services in your environment? 

Honeycomb does not support a single SLO that covers all services in an environment. Instead, we recommend grouping related SLOs by team, product area, or critical path.

Calculating Multi-Service SLOs 

When applied to multiple services:

  • Events from excluded services are ignored.
  • SLIs apply equally to all included services.
  • Events are not weighted by traffic.
Assume that your environment contains `service_a`, `service_b`, `service_c`, and `service_d`.

Assume that your SLO includes service_a, service_b, and service_c:

  • service_a receives 2 events (1 failed)
  • service_b receives 3 events (1 failed)
  • service_c receives 15 events (2 failed)
  • service_d will be excluded from SLO calculations because your SLO does not include it

Your SLO will calculate:

SLI = (number of successful events) / (number of total events) = (1 + 2 + 13 successes) / (2 + 3 + 15 total) = 80%

Detecting Anomalies with Multi-Service SLOs 

You can use BubbleUp with multi-service SLOs in the same way as with single-service SLOs, with a few important differences:

  • Clicking through the SLO Heatmap to the Query Builder loads an Environment query with a WHERE clause that filters by service names included in your multi-service SLO.
  • The service name field used in the Environment query comes from the Service Name field defined in your dataset definitions.
  • The SLO Heatmap and Query Builder heatmap may not match exactly, depending on how the service name is defined:
    • For the heatmaps to align, the service name must match the dataset name.
    • If your dataset definitions include multiple fields for service name (for example, service.name and service_name), Honeycomb will run an environment-wide query without filtering by service name.

Limitations 

  • You cannot create a single SLO that applies to all services in your environment.
  • Multi-service SLOs do not support team activity logs.