What is an error budget?

An error budget is how much downtime a system can afford without upsetting customers, or, in other words, the margin of error permitted by a service level objective (SLO).

Key takeaways

In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes.
While SRE teams usually track an error budget, they don’t make decisions regarding how it should be spent.
Error budgets can be measured in relation to availability or uptime, both of which are defined by a company’s Service Level Objective (SLO).
To use uptime effectively in error budgets, you’ll need to translate SLA/SLO targets into real numbers that development teams can use.

Why tech teams need and use error budgets?

A site reliability engineering (SRE) team comprises software engineers who improve the reliability of their systems and software in production. While SRE teams usually track an error budget, they don’t make decisions regarding how it should be spent. Instead, SRE teams work with development teams to set error budgets and policies.

The key stakeholders involved in creating the error budget are:

Product owners, including product managers, business analysts, and product leads speak on behalf of the customer to the development team to communicate customer needs and the user journey.
SRE and operations teams, including DevOps, ITSM, and problem management, and infrastructure engineers, use software to manage a service, solve problems and automate operations tasks.
Engineers that work on the product.
Customers, since the SLOs are non-legally binding promises that the service provider makes to them.

What is the purpose of an error budget?

There is a delicate balance between releasing new features and maintaining an acceptable level of availability to customers. An error budget tracks if a company is meeting contractual promises for a system or service and prevents it from pursuing too much innovation at the expense of that system or service’s reliability.

In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes. When a company exceeds its error budget, SRE teams can pause innovation to eliminate persistent causes of errors from the system.

Error budgets and SLO

Error budgets can be measured in relation to availability or uptime, both of which are defined by a company’s Service Level Objective (SLO). In other words, an error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget.

Error budgets and SLI

As part of operationalizing SLOs, SRE teams translate SLI percentages in terms of days and hours for software engineers. Service-level indicators (SLIs) are the measures that indicate if an SLO is met or not. The SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken.

Error budgets and SLA

Some downtime is inevitable, which is why Service-level agreements (SLA) should never promise 100% uptime. When an SLO is not met, the terms of a company’s SLA kick in.

Suppose a company has an SLA of 98% and an SLO of 99% availability. The error budget would be 1%, and that 1% in a 28-day window is 6.72 hours of downtime. If the SLI dips below 99% during that 28-day window, then it’s used up all of its error budgets and is no longer meeting the SLO.

If availability is above the number promised by the SLA/SLO, an SRE team can release new features and take risks. But if it’s below the target, releases halt until the target numbers are back on track.

What happens if you’ve spent or are close to spending your error budget?

When an error budget is close to being spent, SRE teams work with the development team to implement alerts and policies to minimize the impact failures and outages have on customers. This alerting policy is what makes error budgets and SLOs actionable.

If a team burns through its entire error budget, then contingency policies can come into effect to prevent further customer impact. For example, going into code red and freezing all new releases until the number of errors is adequately reduced.

If there are simply too many errors, then the SRE team may have to do a system rollback to give developers enough time to deal with the errors gradually and release the changes over time.

How to use an error budget in your organization

Most DevSecOps teams monitor the uptime of applications and systems on a monthly basis. If uptime is above the SLA/SLO number, then engineering teams can take greater risks. This means more feature releases, more experiments, etc. If uptime is below the SLA/SLO number, then teams need to consider this and slow down the release schedule until uptime is back on track.

To use uptime effectively in error budgets, you’ll need to translate SLA/SLO targets into real numbers that development teams can use.

How Sumo Logic can help

Businesses are focused on achieving their goals, which is why they value robust observability platforms, like Sumo Logic, to help them measure their objectives and ensure they’re on track to meeting their KPIs, deadlines, and long-term strategies.

Try Sumo Logic’s free trial today to see how we can help you reach your goals and maintain quality assurance today.

BY SECURITY USE CASE

BY OBSERVABILITY USE CASE

BY INDUSTRY

BY COMPETITION

LEARN

ENGAGE

TRAIN

COMMUNITY