What Happened? On Friday October 21st, Dyn, a major DNS provider, started having trouble due to a DOS attack. Many companies including PagerDuty, Reddit, Twitter, and others suffered significant downtime. Sumo Logic had a short blip of failures, but stayed up, allowing our customers to continue to seamlessly use our service for monitoring and troubleshooting within their organizations. How did Sumo Logic bear the outage? Several months ago, we suffered a DNS outage and had a postmortem that focused on being more resilient to such incidents. We decided to create a primary-secondary setup for DNS. After reading quite a bit about how this should work in theory, we implemented a solution with two providers: Neustar and Dyn. This setup saved us during today’s outage. I hope you can learn from our setup and make your service more resilient as well. How is a primary-secondary DNS setup supposed to work? You maintain the DNS zone on the primary only. Any update to that zone gets automatically replicated to the secondary via two methods: A push notification from the primary and a periodic pull from the secondary. The two providers stay in sync and you do not have to worry about maintenance of the zone. Your registrar is configured with nameservers from both providers. Order does NOT matter. DNS Resolvers do not know which nameservers are primary and which are secondary. They just choose between all the configured nameservers. Most DNS Resolvers choose which name server to use based on latency of the prior responses. The rest of the DNS Resolvers choose at random. If you have 4 nameservers with 1 from one provider and 3 from another, the more simplistic DNS Resolvers will split traffic 1/4 to 3/4, whereas the ones that track latency will still hit the faster provider more often. When there is a problem contacting a nameserver, DNS Resolvers will pick another nameserver from the list until one works. How to set up a primary-secondary DNS? Sign up for two different companies who provide high-speed DNS services and offer primary/secondary setup. My recommendation is: NS1, Dyn, Neustar (ultradns) and Akamai. Currently Amazon’s Route53 does not provide transfer ability and therefore cannot support primary/secondary setup. ( You would have to change records in both providers and keep them in sync.) Slower providers will not take on as much traffic as fast ones, so you have to be aware of how fast the providers are for your customers. Configure one to be primary. This is the provider who you use when you make changes to your DNS. Follow the primary provider’s and secondary provider’s instructions to set up the secondary provider. This usually involves configuring whitelisting the secondary’s IPs at the primary, adding notifications to primary, and telling the secondary what IPs to use to get the transfer at the primary. Ensure that the secondary is syncing your zones with the primary. (Check on their console and try doing a dig @nameserver domain for the secondary’s nameservers.) Configure your registrar with both the primary’s and secondary’s name servers. We found out that the order does not matter at all. Our nameserver setup at the registrar: ns1.p29.dynect.net ns2.p29.dynect.net udns1.ultradns.net udns2.ultradns.net What happened during the outage? We got paged at 8:53 AM for DNS problem hitting service.sumologic.com. This was from our internal as well as external monitors. The oncalls ran a “dig” against all four of our nameservers and discovered that Dyn was down hard. We knew that we had a primary/secondary DNS setup, but neither provider had experienced any outages since we set it up. We also knew that it would take DNS Resolvers some time to decide to use Neustar nameservers as opposed to Dyn ones. Our alarms went off, so, we posted a status page telling our customers that we are experiencing an incident with our DNS and to let us know if they see a problem. Less than an hour later, our alarms stopped going off (although Dyn was still down). No Sumo Logic customers reached out to Support to let us know that they had issues. Here is a graph of the traffic decreases for one of the Sumo Logic domains during the Dyn Outage: Here is a graph of Neustar (UltraDNS) pulling in more traffic during the outage: In conclusion: This setup worked for Sumo Logic. We do not have control over DNS providers, but we can prevent their problems from affecting our customers. You can easily do the same.
Companies that move fast put pressure on developers and QA to continually innovate and push software out. This leaves the people with the pager, quite often the same developers, dealing with a continuous flow of production problems. On-call pain is the level of interrupts (pager notifications), plus the level of work that the on-call is expected to perform “keeping the system up” during their shift. How can we reduce this pain without slowing down development or having decrees like “there shall be no errors in our logs”? Assuming there is no time to do overhauls of monitoring systems, or make major architecture changes, here is a step-by-step approach to reducing on-call pain. Measure On-Call Pain As always, start out by measuring where you are now and setting a goal of where you want to be. Figure out how often your on-call gets paged or interrupted over a large period of time, such as a week or month. Track this number. If your on-call is responsible for non-interrupt driven tasks such as trouble tickets, automation, deployments or anything else, approximate how much time they spend on those activities. Make a realistic goal of how often you think it’s acceptable for the on-call to get interrupted and how much of their time they should spend on non-interrupt driven tasks. We all want to drive the interrupt-driven work to zero, but if your system breaks several times per week, it is not realistic for the on-call to be that quiet. Continuously track this pain metric. Although it may not impact your customers or your product, it impacts the sanity of your employees. Reduce Noise The first step to reducing on-call pain is to systematically reduce the alert noise. The easiest way to do it is to simply ask the on-call to keep track of the noise (alarms that he did not have to fix). Remove alarms where no action is required. Adjust thresholds for alarms that were too sensitive. Put de-duplication logic in place. The same alarm on multiple hosts should log to the same trouble ticket and not keep paging the on-call. If you have monitoring software that does flapping detection, put that in place. Otherwise adjust thresholds in such a way to minimize flapping. Stop Abusing Humans Any time that you have playbooks or procedures for troubleshooting common problems, ask yourself if you are engaging in human abuse. Most playbooks consist of instructions which require very little actual human intelligence. So why use a human do to them? Go through your playbooks and write scripts for everything you can. Reduce the playbook procedure to “for problem x, run script x.” Automate running those scripts. You can start with writing crons that check for a condition and run the script and go all the way to a complex auto-remediation system. Get The Metrics Right Metrics have the ability to reduce on-call pain, if used correctly. If you know and trust your metrics, you can create an internal Service Level Agreement that is reliable. A breach of that SLA pages the on-call. If you have the right type of metrics and are able to display and navigate them in a meaningful way, then the on-call can quickly focus on the problem without getting inundated with tens of alarms from various systems. Create internal SLAs that alarm before their impact is felt by the customer. Ensure that the on-calls can drill down from the alarming SLA to the problem at hand. Similar to deduping, preventing all related alarms from paging (while still notifying of their failure) relieves pager pain. The holy grail here is a system that shows alarm dependencies, which can also be achieved with a set of good dashboards. Decide On Severity If an on-call is constantly working in an interrupt-driven mode, it’s hard for him or her to assess the situation. The urgency is always the same, no matter what is going on. Non-critical interrupts increase stress as well as time to resolution. This is where the subject of severity comes in. Define severities from highest to lowest. These might depend on the tools you have, but generally you want three severities: Define the highest severity. That is an outage or a major customer facing incident. In this case, the on-call gets paged and engages other stakeholders or an SLA breach pages all the stakeholders at the same time (immediate escalation). This one does not reduce any on-call pain, but it should exist. Define the second severity. This is a critical event. When an internal SLA fires and alarm or a major system malfunction happens. It is best practice to define this as an alarm stating that customers are impacted or are going to be impacted within N hours if this does not get fixed. Define the third severity. The third severity is everything else. The on-call gets paged for the first two severities (they are interrupt-driven) but the third severity goes into a queue for the on-call to prioritize and work through when they have time. It is not interrupt-driven. Create a procedure for the non-interrupt driven work of the third priority. Move alarms that do not meet the bar for the second severity into the third severity (they should not page the on-call). Ensure that the third severity alarms still get done by the on-call and are handed off appropriately between shifts. Make Your Software Resilient I know that I began with statements like “assuming you have no time,” but now the on-calls have more time. The on-calls should spend that time following up on root causes and really making the changes that will have a lasting impact on stability of the software itself. Go through all the automation that you have created through this process and fix the pieces of the architecture that have the most band-aids on them. Look at your SLAs and determine areas of improvement. Are there spikes during deployments or single machine failures? Ensure that your software scales up and down automatically. I have not covered follow-the-sun on-calls where an on-call shift only happens during working hours and gets handed off to another region of the world. I have also not covered the decentralized model of having each development team to carry primary pagers only for their piece of the world. I have not covered these topics because they share the pain between more people. I believe that a company can rationally make a decision to share the pain only once the oncall pain has been reduced as much as possible. So, I will leave the discussion of sharing the pain for another blog post.