System Down
-
Postmortem
ART19 suffered a complete outage for 73 minutes today. This was a result of mistakes made by us and we extend our sincerest apologies. We strive for no downtime. Unfortunately, today we failed. What follows is an account of what went wrong and what corrective measures we are taking.
About two weeks ago, an operational oversight process identified a need to vertically scale a cluster of servers driving a core database service; we were utilizing about 70% of the available bandwidth. This upgrade process required a few minutes of downtime. As it was not critical, to make maximum use of this downtime, we delayed the upgrade to simultaneously roll out a software release to the same cluster that offered several new features.
On Monday, September 18th, we rolled out performance enhancements to ART19 that massively improved the speed with which we return RSS feeds to podcast clients. At the same time, we rolled out the new database server software version for testing in our staging environment. We discovered an issue that we are currently working with the vendor to resolve.
We monitored the systems all last week to determine if we needed to take any actions to resolve hardware inadequacies faster than we were able to deploy the new software version; we did not find any reason to do so. This morning, however, we were alerted to an unusually large increase in traffic that was causing a small amount of our users to experience poor performance. We identified this database cluster as the root cause of the performance issues. Specifically, we identified our September 18th performance enhancements as the culprit, as they alleviated a bottleneck that was previously restricting traffic to this sector of our system.
As the database cluster was now operating near its maximum network performance due to the increase in object size, we escalated a ticket with our vendor to help us get new servers in place, and continued to closely monitor the situation. We also scaled up our front end web servers to help lower the number of slow requests.
Around 2:00 PM PDT (UTC-7), a perfect storm occurred, and we started getting hit with a record rate of requests per minute. This was finally too much for the network connection available to this database cluster. Soon after, many requests were unable to be satisfied, so ART19 was no longer operational for many users. The network card was completely saturated delivering these large cache pieces. While the network card was saturated, requests were simply slow, and since we had scaled out our web server fleet, more requests could be slow without causing a problem other than performance issues. However, this could not be sustained for long, and the service eventually went completely out.
After the service went out, we scaled out our web servers in an attempt to serve more requests successfully. We also stopped all network traffic into our web server tier so that we could let the now massive fleet of front end web servers settle and become healthy. Once we reached this point, at approximately 2:40 PM PDT (UTC-7), we resumed all network traffic and saw that we were able to continue serving clients for a few minutes. However, this was short-lived as the underlying network bottleneck remained. It was at this time that we decided that we could not mitigate the outage ahead of replacing the database cluster, so we joined the vendor’s support engineering team on a conference call and began this process.
We worked with the vendor’s support engineering team to safely migrate all of our data to new hosts with significantly more network bandwidth available. Once we finished migrating the most unstable database, at about 3:45 PM PDT (UTC-7), ART19 was operational again. Due to the massive size of our databases, it wasn’t until about 5:30 PM PDT (UTC-7) that we were able to complete the hardware upgrade.
There are a number of things that we learned about our database cluster software as well as the failure modes of our software as the result of this outage. We will be implementing several architectural improvements to make it impossible to saturate the NIC of any instance and ensure that sufficient alerting exists so that we are able to take action before a performance issue escalates to an outage. However, we also know that even the most well-architected systems with the most effective alerting still sometimes experience outages, so we are now investigating functionality that will enable our system to deliver fallback audio files to listeners even if it is completely down.
3:46 PM PDT (UTC-7)
All systems are available. We are still investigating the root cause of failure and will post a full postmortem after we’ve had a chance to assess.
2:33 PM PDT (UTC-7)
ART19 is currently unavailable, and we are investigating. We hope to be back ASAP.