At Nobl9’s annual SLOconf—the first conference dedicated to helping SREs quantify the reliability of their applications through service level objectives (SLOs)—Sumo Logic shared our contribution of slogen to the OpenSLO community, as well as our commitment to OpenSLO as an emerging standard for expressing SLOs as Code.
slogen is an open source, SLO-as-code CLI tool based on the OpenSLO specification. slogen interprets SLOs and alert strategies defined by the OpenSLO specification and automatically creates SLO dashboards and monitors in Sumo Logic. The tool also contains several extensions to OpenSLO based on feedback from customers and our in-house engineering teams.
SLOs help teams define the reliability of their systems and services to make smart decisions about how to build and run applications. Service level indicators (SLIs) are a component of SLOs and measure how a service, typically measured in percentages, is performing.
“The emerging methodologies behind SLOs bring a level of simplification to the very daunting challenge of creating a great digital experience for the users of our systems,” said Christian Beedgen, CTO, Sumo Logic. “To me, OpenSLO acts as a substrate to encode many of the basics of this methodology for everybody to reuse. At Sumo, we are trying to do exactly that with the development of slogen.”
At Sumo Logic, we are actively using slogen to monitor production services and would like to share an example of how our engineering team uses this tool alongside our observability platform to define and monitor service reliability and, frankly, keep on-call pages to an absolute minimum.
Eating our own dogfood: Sumo’s SLOs
Having an observability product ourselves, we draw a hard line in the sand when it comes to dogfooding our platform and tools. Implementing, monitoring, and alerting on SLOs is no exception, and slogen—combined with Sumo’s analytics engine—plays a critical part in how we manage our SLOs as code.
One of the many critical workloads under scrutiny via SLIs/SLOs is our data pipeline, powering anomaly detection of metrics for our Root Cause Explorer. Root Cause Explorer is Sumo Logic’s automated root cause analysis capability. This service accelerates troubleshooting by detecting, contextualizing, and correlating anomalies, or events of interest, at the service, orchestrator, and infrastructure layers of a modern app.
Each bubble in the screenshot below represents an anomaly in an entity and associated metric (e.g., CPU utilization). The y-axis position represents the percent drift of the metric from its expected value after factoring in periodicity. High drift events of interest are more serious than lower drift ones.
Example event of interest in Root Cause Explorer
As you can imagine, we want to surface these events of interest to customers as fast as possible since they represent anomalous conditions in the app stack. Drift calculations are an important prerequisite for creating events of interest. While there are several component SLOs in place tied to measuring the reliability of the overall data pipeline responsible for events of interest (e.g., feature engineering, noise reduction, application of hand-crafted rules to prevent false positives, etc.), one important metric is the latency introduced in calculating drift itself. As a result, we define an SLI for drift calculation latency and set an objective that 80% of drift calculation jobs should complete in under 4400 milliseconds.
Enter OpenSLO and slogen! Digging into the examples directory, you’ll find our drift-calculation.yml file used to define our objectives and alert strategy for this measure.
Notice anything interesting? We’ve extended the OpenSLO spec! Line 15 includes the ability to specify a log query to compute an SLI, while lines 31 and 34 give you the ability to create multi-window, multi-burn-rate alerts. The team at Sumo Logic is actively working with the OpenSLO community to work these into the standard specification, but you can use them now with slogen.
With this SLO now defined as code, creating the related content in Sumo becomes devastatingly simple:
slogen path/to/slogen/samples/logs/drift-calculation.yaml --apply
Let’s take a quick look at some of the key content created in Sumo Logic by slogen. First up are the scheduled views:
After running Terraform, we see several scheduled views running log searches and storing the results. This pre-aggregated data in the scheduled views is made available to the monitors and dashboards to support high-performance dashboarding and real-time alerting. Next up are the monitors used to search, parse, compute, and alert on SLOs whenever the short or long window averages breach the alert threshold:
And finally, we have the dashboard specific to this particular SLO:
These visualizations help our team quickly intuit the overall reliability of this service. Below is an overview of some of the dashboards that have been created.
Availability: Daily, weekly, and monthly availability measurements against the SLO target.
SLO Breakdown: A breakdown of availability and error budget by dimensions important to Sumo Logic to quickly surface and prioritize reliability issues by geographical region and customer tier.
Hourly Burn Rate: The burn rate helps to surface specific times of the day when most failures happen.
Burn Rate Trend: A trend of today’s burn rate compared to the last seven days.
Budget Forecast: A forecast of the remaining error budget.
And that’s how we quantify reliability using SLOs for this service! Feel free to try it out yourself on Sumo with a free account, fork the project to customize, add your platform as a new target, swap out Terraform with Pulumi, propose more extensions to the OpenSLO specification, and hack away! Our training team has also created a video on how to use slogen and create SLOs as code:
We’re thrilled to be a part of the growing OpenSLO community and can’t wait to see where this project goes!
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.