REPORT

2022 Gartner® Magic Quadrant™ for APM and Observability Read the Report

DevOps and Security Glossary Terms

Glossary Terms
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Error Budget - Definition & Overview

In this article
What is an Error Budget?
Who sets and uses error budgets?
What is the purpose of an error budget?
Error budgets and SLO
Error budgets and SLI
Error budgets and SLA
What happens if you’ve spent or are close to spending your error budget?
How Sumo Logic can help
What is an Error Budget?
Who sets and uses error budgets?
What is the purpose of an error budget?
Error budgets and SLO
Error budgets and SLI
Error budgets and SLA
What happens if you’ve spent or are close to spending your error budget?
How Sumo Logic can help

Key Takeaways

  • Your SLO will define your error budget.
  • When a company exceeds its error budget, SRE teams can pause innovation to eliminate persistent causes of errors from the system.
  • If a company exceeds its error budget, customers are likely to complain and be unhappy with the service.

What is an Error Budget?

An error budget is how much downtime a system can afford without upsetting customers, or, in other words, the margin of error permitted by a service level objective (SLO).

Who sets and uses error budgets?

A site reliability engineering (SRE) team comprises software engineers who improve the reliability of their systems and software in production. While SRE teams usually track an error budget, they don’t make decisions regarding how it should be spent. Instead, SRE teams work with development teams to set error budgets and policies.

The key stakeholders involved in creating the error budget are:

  • Product owners including product managers, business analysts, and product leads who speak on behalf of the customer to the development team to communicate customer needs and the user journey.

  • SRE and operations teams, including DevOps, ITSM, and problem management, and infrastructure engineers, use software to manage a service, solve problems and automate operations tasks.

  • Engineers that work on the product.

  • Customers, since the SLOs are non-legally binding promises that the service provider makes to them.

What is the purpose of an error budget?

There is a delicate balance between releasing new features and maintaining an acceptable level of availability to customers. An error budget is used to track if a company is meeting contractual promises for a system or service, and prevents it from pursuing too much innovation at the expense of that system or service’s reliability.

In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes. When a company exceeds its error budget, SRE teams can pause innovation to eliminate persistent causes of errors from the system.

Error budgets and SLO

Error budgets can be measured in relation to availability or uptime, both of which are defined by a company’s SLO. In other words, an error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget.

Error budgets and SLI

As part of operationalizing SLOs, SRE teams translate SLI percentages in terms of days and hours for software engineers. Service Level Indicators (SLIs) are the measures that indicate if an SLO is met or not. The SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken.

Error budgets and SLA

Some downtime is inevitable, which is why service level agreements (SLA)s should never promise 100% uptime. When an SLO is not met, the terms of a company’s SLA kick in.

Suppose a company has an SLA of 98% and an SLO of 99% availability. The error budget would be 1%, and that 1% in a 28-day window is 6.72 hours of downtime. If the SLI dips below 99% during that 28-day window, then it’s used up all of its error budget and is no longer meeting the SLO.

If availability is above the number promised by the SLA/SLO, an SRE team can release new features and take risks. But if it’s below the target, releases halt until the target numbers are back on track.

What happens if you’ve spent or are close to spending your error budget?

When an error budget is close to being spent, SRE teams work with the development team to implement alerts and policies to minimize the impact failures and outages have on customers. This alerting policy is what makes error budgets and SLOs actionable.

If a team burns through its entire error budget, then contingency policies can come into effect to prevent further customer impact. For example, going into code red and freezing all new releases until the number of errors is adequately reduced.

If there are simply too many errors, then the SRE team may have to do a system rollback to give developers enough time to deal with the errors gradually and release the changes over time.

How Sumo Logic can help

Businesses are focused on achieving their goals, which is why they value robust observability platforms, like Sumo Logic, to help them measure their objectives and ensure they’re on track to meeting their KPIs, deadlines, and long-term strategies.

Try Sumo Logic’s free trial today to see how we can help you reach your goals and maintain quality assurance today.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.