More reliable webhooks with queues
(Note: This is the third article in a series about how we scaled Shortcut's backend. Here are the links to previous posts in the series: 1. How we didn't rewrite Shortcut; 2. Monolith, meet mono-repo.)
When it came to splitting services out of our backend monolith, one of the first areas we wanted to tackle was incoming webhooks. Shortcut makes extensive use of webhooks for integration with other services, most notably GitHub.
The original GitHub integration in Shortcut had webhooks arriving at the same cluster of machines as regular API traffic, a.k.a. the monolith. As discussed in the first article in this series, the monolith was convenient for Shortcut developers, especially in the early startup days. But as Shortcut grew and traffic increased, the webhook requests started to cause problems.
At first glance, a webhook seems much like any other API. It is, after all, just another HTTP request. But on closer examination, webhooks have a few important differences that set them apart. Shortcut is primarily a user-facing application: “Normal” API requests are driven by user interaction, whereas webhook requests are driven by events in other systems. Even though it’s all HTTP on the wire, normal APIs and webhooks have different behavior and expectations:
Normal API Request
- Usually one action at a time, limited by how fast a person can type or click
- Response is expected to include results of user’s actions
- Read-only requests are orders-of-magnitude more frequent than writes
- If the application is unavailable, users will typically come back later and try again
- No limit on request rate, may arrive in sudden bursts
- No response body, just acknowledge with `HTTP 200 OK`
- If the server is unavailable, the sender will retry a fixed number of times or not at all, depending on implementation
The unpredictable size and rate of webhook requests sometimes caused us problems when they were hitting the same servers as “normal” API requests. A single user action in GitHub — merging a pull request, say — might reference hundreds or even thousands of commits. Parsing that webhook consumes time and resources on the server that received it, leaving less for other customers. A flood of webhook events could briefly overwhelm a majority of the backend cluster, degrading performance for everyone.
Then there’s the problem of lost webhooks. Most webhook senders are fire-and-forget; if the request does not succeed, it is never retried. The backend API server is a complex beast that could fail for any number of reasons, from hardware in a datacenter all the way up to a bug in our code. If one of those failures happens while handling a webhook request, then that’s it, we’ve lost that webhook event for good. If the backend API is down (hey, it happens) then we lose all the webhook events sent during that period.
Queue first, ask questions later
As a general rule, simpler systems are more reliable than complex ones. But here we have a great counter-example where adding more moving parts can, counterintuitively, make the overall system more reliable.
If we think about the constraints placed on us by incoming webhooks, we see they are mostly about availability: We need to be able to receive a webhook at all times, with minimal interruptions. But importantly, the response to a webhook does not need to include the results of processing that webhook. We can acknowledge the webhook request and then process it later.
So our webhook processing architecture can trade responsiveness for better availability and reduced data loss. Fortunately, there’s a well-known technology that already has these properties: queues! We decided to introduce a queue between the process that receives the webhook and the process that handles it. In our AWS-based infrastructure, it looks like this:
The “Webhook Receiver” process runs an HTTP server to receive webhook requests. It does almost nothing with the request, just writes it to S3, puts a message on an SQS queue, and responds with HTTP 200 OK to the sender.
Another process on a different machine, “Webhook Writer,” pulls messages from the queue and processes them. Most of this code was lifted from the old, monolithic API Server. Essentially, we created a new process that emulated just enough of the monolith to support the webhook-processing code. Ultimately it was this shim that made the whole project feasible, enabling a major architectural change without the cost of rewriting everything.
Why is this design better? The introduction of Webhook Receiver, SQS, and Webhook Writer adds a set of guarantees that were not possible in the old architecture:
- We can always receive webhook events as long as Webhook Receiver, SQS, and S3 are available.
- A message in the queue will remain there until it is processed successfully.
- If the Webhook Writer fails to process a message, it will be retried.
- If a message is not processed after several attempts, it will be moved to a Dead Letter Queue (DLQ).
These guarantees move the burden of availability to components that are intrinsically more reliable. The Webhook Receiver is so simple that we almost never think about it: It just works. SQS and S3, of course, are backed by Amazon. The retry/DLQ behavior is configured in SQS, so the Webhook Writer doesn’t need any code to implement retries.
Our trust in the Receiver stack allows us to take more risks with Webhook Writer. We can deploy test new versions with low risk: If Webhook Writer goes down, messages just pile up in the queue until we fix the bug or roll back to a stable release. Having the messages available in the Dead Letter Queue and S3 is a huge help when debugging. This was valuable when working on a major addition to our webhook integrations, as my colleague Toby Crawley described in How we use clojure.spec, Datomic and webhooks to integrate with version control systems.
Best of all, collecting webhook traffic at the “edge” of our system means we can shut down everything else for maintenance and still not lose any incoming data. We have taken advantage of this to perform critical database maintenance — more about that later in this series.
As with any architecture choice, of course, there are still downsides. Our infrastructure is more complicated, with more things to deploy and more AWS permissions to sort out. Debugging misconfigured webhook integrations is slightly harder, because Webhook Receiver does not provide any feedback to the sender about whether the request was successfully processed.
Overall, though, the added complexity to our systems made our lives as developers easier. This architecture change both made our webhook handling more robust and reduced the load on other backend servers. Removing webhook integrations from the API server monolith allowed us to iterate more rapidly with less risk. Problems with webhook processing have less impact on other features. Finally, this architecture can be adapted to other kinds of asynchronous event processing. Adding support for new version control services (GitLab and Bitbucket) was just the first step. We plan to build more integrations with this architecture in the future, details to come.
- We have to store webhook requests in S3 because they may be larger than the message size limit imposed by SQS. Having the messages stored in S3 also makes it easier to debug failed webhook requests.
- We also considered AWS Lambda for the Webhook Receiver. Although Lambda is an excellent fit for simple tasks like this, at the time we had no operational experience with Lambda so we decided to limit the number of new technologies we were exploring concurrently. We may choose to replace Webhook Receiver with a Lambda function in the future, but so far there has been no compelling need to do so.
- We arrived at the names “Webhook Receiver” and “Webhook Writer” after some thought and discussion. They were among the first new backend services we deployed apart from the API monolith, so we knew we were setting precedents for future services. We wanted to avoid generic terms like “handler” or “processor” since those could be applied to almost any backend service. We tried to think about what makes these services unique. They both deal specifically with webhooks, so that part was easy. The most important property of the first process is that it is always available to receive webhook requests, hence “Webhook Receiver.” For the second, we zeroed in on the fact that it writes the results of processing webhooks into our primary database. Writing to the database comes with its own set of constraints, so “Webhook Writer” captures how it fits into our architecture.
- GitHub also provides a pull-based Events API from which events can be retrieved after the initial webhook notification. Using it requires a deeper integration with the GitHub API than we had at the time. We will likely pursue that in the future, following the pattern Toby described for GitLab and Bitbucket.