Post-mortem of today’s DNS outage
For those who might not have been following or affected, chartbeat just suffered, and is recovering from, a major DNS failure that affected our users’ dashboards. I wanted to give some insight into what happened and explain how we will do things differently in the future.
Yesterday evening, one of the nameservers at our DNS provider started reporting 0.0.0.0 as the IP address for static.chartbeat.com. As you can tell, this isn’t a real IP address and we were stumped as to why it was happening since we had not made any changes that might affect it.
After being immediately alerted by Nagios, we identified the offending nameserver and reached out to our DNS provider to find out what the hell was happening. At the same time, we removed the entry for that nameserver from our system, taking it out of circulation.
We monitored the effects of the changes and everything seemed to go back to normal until early this morning, when our DNS provider began to pull the same trick on a larger scale across multiple nameservers. For some reason, the lifetime of some cached assets (TTL) was being set at 12 hours instead of two hours, meaning any change we made would take at least 12 hours to fully propagate across the web. The wall still bears indentations from my head at this point.
It became quickly apparent that our current DNS provider wasn’t going to be able to fix the situation in the timeframe we needed, so we reached out to Dynect, the DNS provider behind Twitter and bit.ly. Dynect was great and we were able to move our entire infrastructure over to their services before the morning was out. The changes would take a while to propagate because of the rogue TTL setting at our old DNS provider, but at least we knew that when the changes rolled out we’d be on a much more bulletproof DNS system and everyone’s traffic would be back to normal.
And that brings us to now. Dynect and Akamai were both awesome and super responsive throughout, and the bit.ly guys were a great source of advice and support. We were also blown away by the response from our users, many of whom tweeted or emailed incredibly kind messages. Some of them were captured in Erin Griffith’s Adweek piece today.
What did we learn?
Aside from the immediate lessons around which DNS provider to use, I’d say we were probably too optimistic at first about how easily this would be resolved. Once we acted to fix the first bad nameserver, we implicitly assumed things would get better, not worse, and missed a valuable window to have prepared for more extreme options. We should have reached out to Dynect much earlier and had an alternative prepared just in case the situation recurred, rather than simply reacting when everything went crazy a few hours later. We should have (and will be implementing) a protocol to explore several scenarios and what we need to do to mitigate them, rather than simply assuming any crisis is going to follow the path we implicitly think it will.
In the end, it doesn’t matter whether it’s an external service or an internal bug that fails, the responsibility for providing you with the service you deserve is ours and we let you down. We’re incredibly sorry that our users were affected by these issues, we’re humbled by the response and we’re grateful for your support.
Tony Haile, General Manager