Glossary

Error budget


A


B


C


D


E


F


G


H


I


J


K


L


M


N


O


P


Q


R


S


T


U


V


W


X


Y


Z

    An error budget is how much downtime a system can afford without upsetting customers, or, in other words, the margin of error permitted by a service level objective (SLO).

    The key stakeholders involved in creating the error budget are:

    • Product owners, including product managers, business analysts, and product leads speak on behalf of the customer to the development team to communicate customer needs and the user journey.
    • SRE and operations teams, including DevOps, ITSM, and problem management, and infrastructure engineers, use software to manage a service, solve problems and automate operations tasks.
    • Engineers that work on the product.
    • Customers, since the SLOs are non-legally binding promises that the service provider makes to them.

    There is a delicate balance between releasing new features and maintaining an acceptable level of availability to customers. An error budget tracks if a company is meeting contractual promises for a system or service and prevents it from pursuing too much innovation at the expense of that system or service’s reliability.

    In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes. When a company exceeds its error budget, SRE teams can pause innovation to eliminate persistent causes of errors from the system.

    Error budgets and SLO

    Error budgets and SLI

    As part of operationalizing SLOs, SRE teams translate SLI percentages in terms of days and hours for software engineers. Service-level indicators (SLIs) are the measures that indicate if an SLO is met or not. The SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken.

    Error budgets and SLA

    Some downtime is inevitable, which is why Service-level agreements (SLA) should never promise 100% uptime. When an SLO is not met, the terms of a company’s SLA kick in.

    Suppose a company has an SLA of 98% and an SLO of 99% availability. The error budget would be 1%, and that 1% in a 28-day window is 6.72 hours of downtime. If the SLI dips below 99% during that 28-day window, then it’s used up all of its error budgets and is no longer meeting the SLO.

    If availability is above the number promised by the SLA/SLO, an SRE team can release new features and take risks. But if it’s below the target, releases halt until the target numbers are back on track.

    What happens if you’ve spent or are close to spending your error budget?

    When an error budget is close to being spent, SRE teams work with the development team to implement alerts and policies to minimize the impact failures and outages have on customers. This alerting policy is what makes error budgets and SLOs actionable.

    If a team burns through its entire error budget, then contingency policies can come into effect to prevent further customer impact. For example, going into code red and freezing all new releases until the number of errors is adequately reduced.

    If there are simply too many errors, then the SRE team may have to do a system rollback to give developers enough time to deal with the errors gradually and release the changes over time.

    To use uptime effectively in error budgets, you’ll need to translate SLA/SLO targets into real numbers that development teams can use.

    Try Sumo Logic’s free trial today to see how we can help you reach your goals and maintain quality assurance today.

    FAQs

    To effectively implement error budget policies within a development team, clearly define service level objectives (SLOs) and service level indicators (SLIs) that align with the team’s goals. From there, consider the following best practices

    • Establish a structured process for tracking, monitoring, and reporting on error budgets, error rates, and reliability improvements.

    • Encourage cross-functional collaboration between the development team, site reliability engineers (SRE team) and product owners to prioritize balancing new feature development and system reliability.

    • Regularly review error budget consumption and remaining error budget to make informed decisions and address any SLO violations promptly.

    • Continuously evaluate and adjust error budget policies to meet reliability goals, customer experience standards and availability targets.

    Error budgets should ideally be reviewed and recalibrated regularly, typically aligned with the frequency of service level objective (SLO) reviews. This ensures that error budgets remain relevant to performance metrics and organizational goals. Depending on the specific needs of the system and the criticality of the services being provided, error budgets may be reevaluated monthly, quarterly or annually to ensure they accurately reflect the acceptable level of errors that can occur without compromising reliability.