Archive for April, 2011

Post-mortem of today’s DNS outage

April 29th, 2011 by Tony

For those who might not have been following or affected, chartbeat just suffered, and is recovering from, a major DNS failure that affected our users’ dashboards. I wanted to give some insight into what happened and explain how we will do things differently in the future. Yesterday evening, one of the nameservers at our DNS provider started reporting 0.0.0.0 as the IP address for static.chartbeat.com. As you can tell, this isn’t a real IP address and we were stumped as to why it was happening since we had not made any changes that might affect it. Static.chartbeat.com holds all of our static assets, including our images, css, and javascript for our dashboards and the javascript we use to report visitor statistics to our servers. Because of this DNS error, many people were unreachable. This didn’t have any effect on people visiting our customers’ sites, but it did mean the visitors who were hitting the bad nameserver weren’t being reported. As a result, dashboards showed a dip in traffic. After being immediately alerted by Nagios, we identified the offending nameserver and reached out to our DNS provider to find out what the hell was happening. At the same time, we removed the entry for that nameserver from our system, taking it out of circulation. We monitored the effects of the changes and everything seemed to go back to normal until early this morning, when our DNS provider began to pull the same trick on a larger scale across multiple nameservers. For some reason, the lifetime of some cached assets (TTL) was being set at 12 hours instead of two hours, meaning any change we made would take at least 12 hours to fully propagate across the web. The wall still bears indentations from my head at this point. It became quickly apparent that our current DNS provider wasn’t going to be able to fix the situation in the timeframe we needed, so we reached out to Dynect, the DNS provider behind Twitter and bit.ly. Dynect was great and we were able to move our entire infrastructure over to their services before the morning was out. The changes would take a while to propagate because of the rogue TTL setting at our old DNS provider, but at least we knew that when the changes rolled out we’d be on a much more bulletproof DNS system and everyone’s traffic would be back to normal. And that brings us to now. Dynect and Akamai were both awesome and super responsive throughout, and the bit.ly guys were a great source of advice and support. We were also blown away by the response from our users, many of whom tweeted or emailed incredibly kind messages. Some of them were captured in Erin Griffith’s Adweek piece today. What did we learn? Aside from the immediate lessons around which DNS provider to use, I’d say we were probably too optimistic at first about how easily this would be resolved. Once we acted to fix the first bad nameserver, we implicitly assumed things would get better, not worse, and missed a valuable window to have prepared for more extreme options. We should have reached out to Dynect much earlier and had an alternative prepared just in case the situation recurred, rather than simply reacting when everything went crazy a few hours later. We should have (and will be implementing) a protocol to explore several scenarios and what we need to do to mitigate them, rather than simply assuming any crisis is going to follow the path we implicitly think it will. In the end, it doesn't matter whether it's an external service or an internal bug that fails, the responsibility for providing you with the service you deserve is ours and we let you down. We're incredibly sorry that our users were affected by these issues, we're humbled by the response and we're grateful for your support. Tony Haile, General Manager

Recovery Update

April 29th, 2011 by Isaac

Update: 1:17 PM
We have completed our DNS migration to Dynect (http://dyn.com/enterprise-dns/dynect-platform). They are the industrial strength DNS service powering Twitter and bit.ly, and have been incredibly responsive. We’d also like to thank Akamai (http://www.akamai.com/) who were also extremely helpful throughout. Once again, we’re sorry for the problems earlier today and we’re doing everything we can to make sure we’re using the most reliable services available moving forward. We’ll be posting a detailed explanation of the DNS issues later this afternoon. In the meantime, please reach out to us at support@chartbeat.com with any questions or concerns.
------ Update: 12:10 PM
We’ve completed our migration to the new DNS provider. Traffic levels in the dashboard will begin to get better soon but, because of caching, it will take a while for a complete recovery. We’re sincerely sorry for this problem and we’re here to answer any questions you may have about your site and chartbeat. Please send us an email at support@chartbeat.com anytime.
------ Update: 11:13 AM We’re in the process of moving to a new DNS provider. As the correct IP address of the nameserver begins to propagate, you will see traffic numbers on your dashboard recover. We will continue to update as we see improvements on our end. ---- Update: 6:39 PM We've completed the DNS migration and the changes are propagating across the internet. Numbers should return to normal soon. A full post-mortem of the situation is posted here. Thank you for your patience. ---

Service Interruption Update

April 29th, 2011 by Isaac

Last evening, we began to see domain name resolution issues with our DNS provider. As a result, a subset of chartbeat customers are resolving to an invalid IP address for the static.chartbeat.com subdomain and are unable to load the chartbeat ping javascript.

This doesn’t have any effect on the users visiting your site, but it means that some dashboards will only show a sample of visitors for a little while.

We’re working to migrate our DNS records to a new service provider. We’ll send another update shortly.

Thank you for your continued patience. We’re really sorry and we’re working as hard as we can to get everyone back to normal.