Sumo Logic ahead of the packRead article
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
A quarter has passed since we launched our Reliability Management capabilities that help developers focus on defining, monitoring and managing Service Level Objectives (SLOs) to drive great digital experiences. Reducing alert fatigue and balancing innovation with reliability are common outcomes that customers expect from Reliability Management.
If you are new to SLOs, these insights from our customers capture common practices among peer developers. Where possible, we showcase best practices that you can adopt on your own journey. If you have practiced SLOs for a while, some of these insights might surprise you!
75% of organizations have five or fewer SLOs in place, indicating that they are in the early stages of SLO adoption. 12.5% of customers have more than 50 SLOs. These customers have automated the SLO management process through APIs.
Customers in APAC, in particular, Australia and India, have 17 SLOs on average, while NAM customers created 12 SLOs on average. In contrast, and surprisingly, EU customers seem to be lagging behind with just two SLOs per organization on average.
Anecdotally, many Sumo Logic customers in Australia and India are born in the cloud, modern app companies while other regions feature a mix of modern and modernizing applications. At first glance, this suggests that SLO adoption is most prevalent in modern app contexts — more data and contextual analysis will be required to confirm this hypothesis.
Over 90% of SLOs are based on the latency, availability and error indicators. We have always recommended latency and errors (or a combined error-free latency) as best practice signals for SLOs, as they are good indicators of great user experiences. It is great to see that reflected in the data.
Somewhat surprising is the relative importance of availability as an SLO signal. Simply put, an unavailable app will have an infinite latency and a latency SLO, as a result, would capture availability as well. We suspect customers want redundancy in their SLO strategy.
This is similar to redundancy for monitors — some Sumo Logic customers, including Sumo Logic itself, have more monitors than they need to account for potential unreliability of particular monitors. Customers use the other category for non-observability indicators; for example, we have a customer that uses SLOs for assessing the reliability of their security operations.
Error SLOs feature the most aggressive target of 99.95% on the median. Latency and other signals are less aggressive with a median target of 99%. This seems reasonable as errors impact user experience more than latency because users might tolerate latency more than errors.
73% of SLOs are defined on logs which underscores the importance of logs for top-level application observability.
62% of SLOs use request-based evaluation. Request-based SLOs are easy to understand and set up compared to window-based SLOs — we are surprised that customers do not have an overwhelming preference for them like we expected.
It is possible that customers find windows-based SLOs offer the ability to lessen the impact of particularly bad days on their reliability despite the complexity of their configuration.
75% of SLOs use rolling compliance periods compared to 25% calendar compliance. We speculate that calendar compliance is used by customers that offer Service Level Agreements (SLAs). Anecdotally, these are less common for modern apps.
Some customers have also mentioned aligning compliance periods with sprint boundaries used by their development teams. Such a practice, when adopted more broadly, would result in greater prevalence of calendar compliance periods.
Most common duration for rolling compliance are seven and 30 days followed by one day. It is possible that these are the result of anchoring — as our user interface offers day, week and month as drop down choices.
Most common duration for calendar compliance is one week with 98% of SLOs using this value. We suspect this the result of anchoring — as our user interface offers calendar week as a drop down choice.
Surprisingly, customers set up monitors for only about 6% of SLOs. In other words, most often customers consult dashboards to assess SLO performance and don’t opt to get alerted on it. We’re surprised by this as we assumed that teams would prefer to get informed automatically when reaching various SLO thresholds. The desire to track this directly in the dashboard implies that teams are actively monitoring their SLOs on an ongoing basis.
Many customers use SLOs as a planning tool to balance innovation with reliability — and planning activities are best served via dashboards. We will continue to watch this indicator for shifts in customer behavior.
When customers do set up monitors for SLO, these monitors are triggered only two times in 30 days. In other words, SLO alerts trigger rarely, which implies that customers are doing a great job with reliability already. Perhaps this is also a sign that those using this capability in Sumo Logic were already more established in their reliability journey.
In addition, SLO-based monitoring has the potential to streamline alerts significantly. We will continue to watch this indicator as a sign of enhanced progress in reliability management across our customer base.
Learn more about how SLOs, and more generally reliability management, can improve your decision making. Get started today using this microlesson.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.Start free trial
Moving to the cloud offers more than economics; it comes with unique security challenges that on-premises solutions cannot address. In minutes, Cloud Infrastructure Security for AWS from Sumo Logic brings cloud-native security analytics to AWS cloud environments. Curated workflows, out-of-the-box dashboards and AI-driven anomaly detection help security personnel easily monitor cloud security posture and cloud configurations and manage cloud risk from a centralized platform.
In a perfect world, computers would function properly on the network at all times. There would be no issues with the operating system and no problems with the applications. Unfortunately, this isn’t a perfect world. System failures can and will occur, and when they do, it is the responsibility of system administrators to diagnose and resolve the issues. But where can system administrators begin the search for solutions when problems arise? The answer is Windows event logs.