2022 Gartner® Magic Quadrant™ SIEM
Get the reportMore
What Happened?On Friday October 21st, Dyn, a major DNS provider, started having trouble due to a DOS attack. Many companies including PagerDuty, Reddit, Twitter, and others suffered significant downtime. Sumo Logic had a short blip of failures, but stayed up, allowing our customers to continue to seamlessly use our service for monitoring and troubleshooting within their organizations.
How did Sumo Logic bear the outage?
Several months ago, we suffered a DNS outage and had a postmortem that focused on being more resilient to such incidents. We decided to create a primary-secondary setup for DNS. After reading quite a bit about how this should work in theory, we implemented a solution with two providers: Neustar and Dyn. This setup saved us during today’s outage. I hope you can learn from our setup and make your service more resilient as well.
How is a primary-secondary DNS setup supposed to work?
How to set up a primary-secondary DNS?
Our nameserver setup at the registrar:
What happened during the outage?
We got paged at 8:53 AM for DNS problem hitting service.sumologic.com. This was from our internal as well as external monitors. The oncalls ran a “dig” against all four of our nameservers and discovered that Dyn was down hard.
We knew that we had a primary/secondary DNS setup, but neither provider had experienced any outages since we set it up. We also knew that it would take DNS Resolvers some time to decide to use Neustar nameservers as opposed to Dyn ones. Our alarms went off, so, we posted a status page telling our customers that we are experiencing an incident with our DNS and to let us know if they see a problem.
Less than an hour later, our alarms stopped going off (although Dyn was still down). No Sumo Logic customers reached out to Support to let us know that they had issues.
Here is a graph of the traffic decreases for one of the Sumo Logic domains during the Dyn Outage:
This setup worked for Sumo Logic. We do not have control over DNS providers, but we can prevent their problems from affecting our customers. You can easily do the same.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.Start free trial