July 9, 2020 Incident Postmortem
If you attempted to use Shortcut on Thursday, July 9th from 6:39am to 9:40am UTC (2:39am to 5:40am Eastern Time), you probably noticed that we were experiencing a major outage. You may have also noticed that our Status Page claimed that everything was just fine, which was obviously not the case. As with almost all major outages, the severity of this incident was not caused by any one problem, but instead was due to a chain of events that sent things off the rails. Given the scale of the outage, we want to share our postmortem publicly to ensure that our customers understand what happened and what we’re doing to prevent incidents like this from happening again in the future.
- 5:02 AM UTC - Attempt to deploy fails
- 6:39 AM UTC - Engineer on-call is paged and responds
- 7:13 AM UTC - Service becomes unavailable
- 7:15 AM UTC - Root cause identified
- 7:16 AM UTC - Rollback fails; Begin solutioning
- 7:54 AM UTC - Solution identified; Begin rolling out fix
- 8:48 AM UTC - Deploy completes but fails to restore service; Begin investigation
- 9:30 AM UTC - Second solution identified; Begin rolling out fix
- 9:40 AM UTC - Incident fully resolved
What went wrong?
Fundamentally, Shortcut became unavailable to our users because of insufficient capacity of our API service.
Some background: To conserve resources we go through scaling events every day. We autoscale up before our busy period gets going (as Europe is waking up) and scale down as traffic slows (around the time San Franciscans have too many burritos in their hands to be able to type). Every day when we scale up at the predetermined time, we start lots of fresh machine instances. After each instance starts, our deployment system pushes the most recent successful application revision to that instance.
Ahead of our June 9th autoscaling event, our platform team pulled in a benign looking update for awscli. This update changed the file path for the binary to /usr/local/bin/aws from /usr/bin/aws, and introduced an incompatibility with our deploy scripts that caused our API Server’s deployments to fail on any newly launched instance.
Beginning at 6:11 AM UTC, we observed deployment failures for our API server as we were unable to start new machines. As traffic from Europe increased, the API service was running with the overnight fleet of servers. As a result, all users were effectively unable to access the application until we restored normal capacity around 9:30 AM UTC.
How did this happen?
Specifically, how'd we end up with an AMI that failed to start?
It’s Shortcut policy to roll out any and all security updates out within a set, short period of time. While we deploy changes to our application many times per day, AMI updates only go out once per day as we rotate instances and scale up. Our platform team normally tests AMI changes in our staging environment for several days before rolling out to production, but in this case they merged a change to our Terraform configuration for both environments simultaneously. As a result, we failed to detect that our AMIs were in a bad state until we started doing our daily instance rotation.
Why did it take so long to recover?
We encountered a number of issues that increased the time required to restore service.
Insufficient severity for deploy failure alarms
We first observed the impact of this change at 5:02 AM UTC on June 9th. A code deploy failure occurred at this time for our webhook receiver application. A Slack Notification was sent and a message appeared in our AWS Console, but we didn’t see it as nearly everyone who might have seen it was asleep and the severity wasn’t set high enough to page the Engineer on-call.
Failed on-call escalation
At 6:39 AM UTC, we received an alarm that indicated that there was an issue with our inbound webhook handlers. This is not as serious a problem as the API Servers being unavailable, but we consider it a severe issue. PagerDuty notified the Engineer on-call, who acknowledged the incident and began investigating.
In fact, the Engineer on-call identified the root cause moments after the service became unavailable at 7:13 AM UTC. Many parts of our system had become unavailable, including our service for building AMIs. The Engineer on-call then attempted to page our Platform team for assistance to ensure a smooth rollback, but was unable to reach an engineer from that team due to a misconfiguration of the notification system.
Inconsistent Terraform state
After the issue was identified at 7:15 AM UTC, the Engineer on-call was joined by another engineer and together they developed a fix and attempted to roll it out. This required a change to our infrastructure configuration, which is managed via Terraform. Our process allows Terraform changes to be applied outside of version control when necessary. This enables us to quickly make updates, but makes it more difficult to know what is currently live. Usually the latter is not important as changes get merged very quickly. In this case, the Engineer on-call was unable to initiate a simple revert, and instead needed to rebase on top of all PRs labeled "terraform-applied" to ensure their change had a minimal surface area. This caused confusion and delayed the initial rollback until 7:54 AM UTC.
Conservative scale up process
After the initial rollback was initiated, we continued to monitor system status. While we saw new servers starting up and being available to serve traffic, we continued to observe many 5xx responses and timeouts. By 8:48 AM UTC when the rollback completed, we knew something else was wrong.
More context: Our current scale up process only adds new instances into service when all running instances are passing their health check. This is done to ensure that when we start new instances, we don’t add them into rotation until they’re ready. When starting several instances at once, they will report as healthy at different times and may get added into rotation at different times. This is problematic when we are severely under capacity, as the new instances also become overwhelmed, fail their health check, and prevent other healthy instances from being added into service to meet capacity demands. We have a procedure for updating the scale up process to continue adding healthy instances into service even when others are failing their health check, however, this was not documented as part of the runbook.
The Engineer on-call suspected this was the issue by observing machines starting, several coming into service, and reporting as unhealthy on our load balancer status page. The Engineer re-implemented this process and deployed that change at 9:30 AM UTC. Service quickly recovered thereafter.
Why wasn’t our status page up-to-date?
At 7:17 AM UTC we received an alarm indicating our Production API Server response times were beyond tolerance. The Engineer on-call should have taken a moment to update the Status Page and Twitter, but they were knee deep in investigation and there wasn't anyone else at that time to manage public communication. They paused to provide an update around 8:19 AM UTC, but only after starting rollout of the fix. We very rarely have outages on this scale — this last time was around two years ago — so we haven’t regularly tested our processes around end-user communication. Regardless, our response here was not acceptable.
What steps are we taking to help stop this from happening in the future?
- We're adding automated notifications for our Status Page to ensure our customers get notification of incidents in a timely manner. We’re also updating our general incident checklist to include items around public communication.
- We're going to begin sending alarms for failed deployments and update our runbooks to account for them so they're less likely to set off this sort of chain.
- We’ll start regularly testing our escalation tree and automatically paging additional people when there’s a major incident. The Engineer on-call may not respond to a page, may need help with communication, or may not be able to resolve the incident themselves. We need to ensure the necessary resources are available when things go awry.
- We’re adding instructions to our runbooks to bypass the health check when under capacity and we’re going to investigate alternatives to our current instance scale up process.
- We've updated our process to ensure that AMIs are deployed to staging first, and tested there for at least 24 hours prior to being released to production.
Ideally, it'll be two years (or more) before we face another problem on this scale. But whether it's two years or two months or two millennia, what we've learned from this incident will absolutely make us much more prepared to mitigate, communicate, and fix these sorts of problems. This has been a valuable experience for our team, albeit one we very much wish we could have avoided.
That said, we know that this was not a valuable experience for any of you who were trying to get work down during the outage. I sincerely apologize that this happened and cannot stress enough just how seriously we take these kinds of issues. We hate letting any of our customers down and are truly sorry if you were impacted by this incident.
I also want to express my personal appreciation to the team who handled this incident: Our Engineer on-call who was paged in the middle of the night and worked through issue after issue until the incident was resolved; the Engineers who happened to look at their phone, noticed an incident was ongoing, and jumped on to help; the Support and success team who fielded customer questions; the folks who participated in a constructive postmortem and have committed to addressing the action items to prevent this. I feel lucky to be able to work with such a collaborative team.