Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
March 14, 2024
Noisy monitors can lead to alert fatigue, which frustrates engineers and hinders innovation. With our patent-pending anomaly detection capabilities built on the power of AI, you can eliminate 60-90% of alerts. A unique differentiator, Sumo Logic’s alerts can also trigger one or more playbooks to drive auto-diagnosis or remediation and accelerate time to recovery for application incidents.
Faster issue remediation means engineers can focus more time on development and releasing software. The combination of next-generation anomaly detection and automation is part of our AI-driven alerting capabilities available to all Sumo Logic customers that will change how you troubleshoot.
The noisiest 5% of monitors used by Sumo Logic trigger seven times per day, as in the graph below. Anecdotally, half of these noisy alerts trigger after regular work hours. These stats imply that the volume of alerts from modern applications can overwhelm on-call teams.
A substantial portion of these alerts are often irrelevant or insignificant, contributing to alert fatigue among operators. The image below shows a monitor for a Sumo Logic customer, which, over three days, generated two false positives (i.e. false alarm) and one false negative (i.e. alert was not triggered when it should have). False positives are a distraction that pulls engineers away from their focused work, while false negatives hide genuine problems that developers actually need to act on.
For AI-driven alerting, we were convinced we could leverage real-time AI and ML to drive up our accuracy and keep developers focused on the work they do best.
While reducing false positives helps developers with alert fatigue, when incidents do occur, you want to resolve them quickly. AI-driven alerting also features playbooks for automating incident diagnosis and if necessary, recovery.
First-generation anomaly detection, such as Sumo Logic outlier, learns dynamic baselines from recent data points and can avoid the problem of finding optimum static thresholds. However, these techniques can still generate false positives because they:
AI-driven alerting addresses challenges with first-generation anomaly detection through the following strategies:
Model-driven anomaly detection: AI-driven alerts use 60 days of historical data (when available) to train and test an ML model so that hourly, daily and weekly (especially, weekday/weekend) seasonality are factored into baselines. An anomaly is an unusual datapoint compared to the baseline or expected value.
AutoML: AI-driven alerts embed an AutoML framework where the analytics tune itself based on model performance on training datasets. Simply put, AutoML supports a “set it and forget it” experience with minimal user intervention.
Model contextual and dynamic thresholds: AI-driven alerts have a sensitivity setting (low sensitivity for signals that are expected to be noisy and high sensitivity for critical indicators). Additionally, the user can configure the incident detector based on context. For example, in the Cluster detector, the user can specify how many data points in a detection window of say 5m need to be unusual before triggering an alert.
One of our preview customers for AI-driven alerting is a B2C modern application company that had many first-generation Sumo Logic outliers that were noisy primarily because they missed the weekday/weekend periodicity of their signals.
AI-driven alerting successfully modeled the periodicity in the data as indicated in the blue line in the chart below, while the red lines are the upper and lower bounds predicted by the ML model. With AI-driven alerting, false alarms were successfully mitigated while alerts were triggered on genuine issues.
When incidents are detected correctly via anomaly detection or otherwise, you want to resolve them quickly to minimize customer impact and lost revenue. Recovery time for production incidents is about 30 minutes, which is driven largely by the ad hoc nature of reading through text playbooks, contacting subject matter experts, collecting diagnostics, forming hypotheses and taking action. What if diagnosis and/or recovery time could be reduced to five minutes through automation?
Sumo Logic Automation Service is now integrated with monitors. Any logs or metrics monitor can be associated with one or more playbooks authored by subject matter experts in the Automation Service. When the monitor triggers, the playbooks execute and cut minutes and hours from the response.
Here is an example of an auto-diagnosis playbook, in response to a site-down alert, where the customer is running six log searches and one metrics search in parallel, collating the results and alerting an on-call user with a summary of the incident. In some cases, the root cause might be part of the summary; in other cases, the playbook helps eliminate known root causes so that the on-call engineer can begin an ad hoc investigation. Either way, this auto-diagnosis playbook reduced the recovery time.
While these examples are related to application incidents, AI-driven alerting is also relevant for security alerts, by cutting noise and automating incident response through playbooks. Many Cloud SIEM customers use playbooks already; with this release, playbooks can be attached to any logs-based security monitor. With Flex Licensing, our aim is to cover 100% of your logs and use cases.
AI-driven alerting reduces alert noise and accelerates incident diagnosis and recovery time, changing how you troubleshoot and secure your applications and infrastructure. Learn more about AI and log analytics, or start your free trial and test it for yourself.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial