Outage

January 28, 2017

Postmortem
Around 12:30 PM PST on January 27, 2017, a customer added the ART19 web player to a very high traffic website. This resulted in a surge in traffic, which we were immediately alerted to. Our systems were scaled up to handle this surge in traffic, however end user performance was still unacceptable after the systems came online. We looked at the health of all systems at ART19 and saw that the database server had no free memory. Looking at the history, the server was operating with an insufficient amount of free memory for some time. Prior to the traffic surge, the database server was performing as expected. However, after the surge, the database was unable to handle the traffic, and was regularly returning queries ten times slower than our typical 99th percentile query speed. We also started to see an increase in the number of refused requests, and made the decision to perform an emergency scale up operation on the database server around 1:10 PM PST.

We utilize the Relational Database Service (RDS) by Amazon Web Services and run our database with a Multi-AZ configuration. A scale up operation is a typical maintenance task that results in little to no downtime. We have performed many scale up operations in the past, and all of them resulted in ART19 being unavailable for 30 seconds or less (typically 3-5 seconds, fast enough that our load balancing layer reroutes your request and end users see zero downtime). This scale up operation was expected to be typical, like all other scale operations in our history.

During this scale up operation, we changed from the m4 to the r3 instance class, increased our provisioned input/output operations per second (IOPS) in preparation for increased traffic in the coming months, and chose to upgrade our database version. This combination of changes caused RDS to fully take our database offline, which is not typical. Typically, RDS will make your requested changes on the slave instance, promote it to master, and demote the old master. It is this switch over in roles that results in downtime, and normally happens in a few seconds.

After the version upgrade completed at 1:13 PM PST, the database became available for queries again. However, at this point, the database server was still on an instance that was too small to handle the increased traffic, and since the database server had been restarted due to the version upgrade, no query results were cached in memory. The result of this was that every end user request was taking at least 30 seconds, and most were taking hundreds of seconds. Most browsers give up well before we were able to return a result. In addition, the influx of slow queries caused transactions to deadlock, which meant that RDS did not continue to the next step (upgrading to a different instance type) until we killed all queries and stopped traffic from reaching the database server. We initially did this by disabling our application’s ability to communicate with the database server, but this caused our application servers to become unhealthy, which caused other issues.

RDS completed the hardware upgrade around 3:00 PM PST. After testing, we opened the system back up, and CPU usage immediately went to 100%, even though we did not see any issues with any other resources on the system and even though it performed as expected when we tested it. According to the RDS documentation, storage performance upgrades can result in reduced performance while in progress, so we expected some amount of poor performance for the first few minutes of operation. After about half an hour, performance hadn’t improved, and we noticed that performance remained constant independent of concurrency, so we increased our stack size significantly with the expectation that we would be able to serve more clients, even if we were serving them slowly.

RDS completed the storage upgrade around 4:00 PM PST, and our increased capacity hadn’t significantly increased the number of requests per second that we were able to respond with because our health checks were considering instances to be unhealthy if it took more than one second to respond. This meant that our web application was only available on less than 1% of our running application containers. We made some changes to our health check infrastructure to be more tolerant of application servers that are working slowly. This allowed more concurrency (20% of our stack was reporting healthy), and we were able to serve a few thousand requests per minute, but all of them were very slow and we were clearly still experiencing problems with our database.

Around 5:00 PM PST, we performed a Multi-AZ fail over to rule out the database server hardware having an issue. It took about 5 minutes to fail over, and the new server also had a similar performance problem. We restarted all of our web application containers and scaled down our job processors to rule out any poorly behaving application servers. It took about 20 minutes for the infrastructure to settle, and the problem still persisted. Around 5:30 PM PST, we initiated a full (non-fail over) reboot of the database server. We also blocked all traffic at our load balancer, to see under which conditions the database server locked up. It took about 10 minutes for the instance to come back online, and was performing as expected. We slowly turned on various points of presence (POPs) on our load balancer, and as soon as we were doing 100 requests per minute, the database server exhibited the same performance problems as before. The database instance type we were using had more than 4 times the compute and memory as the previous instance type we were using.

Around 6:15 PM PST, we cut traffic off again, and measured how long it took for the requests to finish, so we have some data points to go off of. The fastest completed in 45 seconds, the slowest in 1800 seconds (30 minutes). These are requests that typically complete in 75 milliseconds. Around 6:45 PM PST, we made the decision to switch back to the m4 instance class. This was performed on the slave first and failed over as expected, so the database endpoint was only unavailable for a few seconds. The entire instance class change took about 20 minutes.

Around 7:10 PM PST, we started to progressively let traffic back in at our load balancer. The database instance performed exactly as expected as we ramped traffic back up, and we had more than half of our POPs re-enabled by 7:46 PM PST. By 8:10 PM PST, we had all of our points of presence re-enabled, and were carefully monitoring the situation. The database instance again performed exactly as expected, and we considered the system to be back online by 8:14 PM PST. Our support team then began to contact our customers directly and our customers’ listeners via social media to let them know about the restoration in service.

We continued to monitor the entire system for several hours, and rolled out a change to our web servers around 8:30 PM PST to allow them to queue more requests, as log analysis had indicated that many times, users were getting turned away because our web servers were excessively busy. Our web servers are not normally busy enough for queuing to be any significant amount of a request’s run time, however in the event of a database issue like we initially experienced, we would be able to tolerate a higher quantity of slow requests for a longer period of time without refusing them.

As a result of this outage, we identified several corrective actions we are taking to prevent a similar incident from happening in the future. We sincerely apologize to our partners, customers, and their listeners for this outage. We hope that this detailed postmortem will help you understand the cause of the outage and what we are doing to ensure it doesn’t happen again.

Corrective Actions
1. We have added alarms to our database server so that we are alerted when the amount of free memory gets low, so that we can schedule hardware changes to occur during our weekly maintenance window. If we had this alarm in place prior to our outage, we would likely have still caused an outage during our maintenance window, as the database server being unable to handle traffic after scaling up was due to it being an r3 instance class.
2. If we need to stop database traffic in the future, we will block traffic at our load balancing layer instead of our database layer, in order for our stack to become healthy more quickly. It took too long for our stack to internally become healthy after disabling and re-enabling traffic from the application to the database server.
3. We will not use any database instance types that have ephemeral storage, as that is the major difference (other than amount of RAM) between an m4 and an r3 instance class.
4. We will institute a policy where only one database change can be performed at any given time, even in emergencies. This should cause RDS to always do a fail over configuration change every time.
5. We will be rolling out a change to our web player so that a page load alone does not cause a player to reach out to ART19, when this is possible. Users and customers should not notice this change, since we are only making this change to certain player types which do not need to be preloaded. This change will also be beneficial to users on slow networks, like smartphones, as they will use less bandwidth to load our customers’ pages when possible.
6. It was already on our roadmap to decouple the listener-facing endpoints from the producer-facing endpoints, but we have prioritized this higher. If this were in place, ART19 would have retained partial availability during this outage.
7. We will configure our CDN to have a more helpful error page, directing visitors to this status website in the event that the CDN cannot reach ART19.
Status Updates
2017-01-27 8:14 PM PST (UTC-8)
We are back online. We will continue to monitor the situation closely.

2017-01-27 7:46 PM PST (UTC-8)
We will be working on a postmortem, but we appear to have resolved the failure. We are currently progressively enabling points of presence (POPs) on our CDN, but we are seeing no issues with the POPs we have turned on. If you are nearest to a POP that we have enabled, you will be able to access ART19. If you are not, you will continue to experience connectivity issues. We have enabled nearly half of the available POPs, and have had no issues thus far. We anticipate returning to full availability soon.

We will update this page when we are fully available.

Thank you for your continued patience.

2017-01-27 3:31 PM PST (UTC-8)
We are currently experiencing slower than normal queries, which is resulting in there being insufficient capacity to handle the number of users accessing the application. Many requests are completing successfully, but most users are having trouble accessing the application. We are currently adding a large amount of capacity and expect this to taper off shortly.

2017-01-27 1:48 PM PST (UTC-8)
The emergency maintenance is taking longer than anticipated to complete. As a result, ART19 is intermittently available. We will continue to update this page as we have more information.

2017-01-27 1:13 PM PST (UTC-8)
ART19 is currently unavailable due to emergency unplanned maintenance. We expect service to return to normal in the next few minutes.

Outage

Postmortem

Corrective Actions

Status Updates

2017-01-27 8:14 PM PST (UTC-8)

2017-01-27 7:46 PM PST (UTC-8)

2017-01-27 3:31 PM PST (UTC-8)

2017-01-27 1:48 PM PST (UTC-8)

2017-01-27 1:13 PM PST (UTC-8)