How we use clojure.spec, Datomic and webhooks to integrate with version control systems

Earlier this year, Shortcut launched new integrations for both Bitbucket and GitLab. These integrations work similarly to our existing GitHub integration, allowing you to attach commits and branches with Stories within Shortcut, and transition those Stories between workflow states based on state changes of your merge/pull requests (see our Help Center for details). These integrations are powered by webhook events sent to us from Bitbucket/GitHub/GitLab on your behalf.

In this post, we’re going to take a look at the design of the subsystem that powers these integrations, and which will also serve as the basis for the next version of our GitHub integration (currently in development).

We'll cover some of the requirements that we considered when designing the subsystem that receives and processes the webhook events mentioned before, an overview of what we built, and some of the lessons that we learned in the process.

Non-functional Requirements

When designing this subsystem, we had a few non-functional requirements that we wanted to meet:

Availability: The subsystem needs to be generally available. Since webhook events from external services drive it, it needs to be reachable to capture those events when they occur.
Extensibility: The subsystem needs to be built in a way where it is easy to extend with new features, especially by developers that didn't work on the initial implementation.
Idempotency: The subsystem needs to support processing the same events more than once, since the providers don't guarantee at-most-once delivery of the webhook events.
Maintainability: The subsystem needs to be straightforward to maintain (i.e., fixing faults), especially by developers that didn't work on the initial implementation.
Recoverability: The subsystem needs to allow for recovering from errors, re-processing events that trigger an internal fault once that fault is corrected.
Reliability: The subsystem needs to handle transient failures when processing events, re-processing the events as necessary.
Transparency: The operations performed (or not performed) by the subsystem need to be visible. Since direct user interactions don’t drive this, we need a way to inspect the subsystem's behavior, especially when users report potentially incorrect behavior.

System Design

With our initial GitHub integration implementation, we handled all of the webhooks during a request cycle within our API server instance. There were a couple of notable issues with this approach:

With multiple API server instances and no guarantee of at-most-once delivery, we would see race conditions where two API servers were processing the same event at roughly the same time, leading to errors when the second server tried to transact the same changes to the database.
If a transient fault occurred when handling an event, we would have to rely on the provider (in this case, GitHub) to resend the event so we could try to process it again. If the fault were a regression that we caused, GitHub would give up trying to send the event before we could deploy a fix, and we would lose the event completely.

We wanted to address these issues as part of the design of this new subsystem, so decided to decouple the receipt of the event from the processing of the event, with a mechanism to bring retries sufficiently under our control.

At a service level, this new design has this structure:

Receiver Service

This service is relatively simple - it receives the webhook payload from the provider, writes it to an Amazon S3 bucket, then emits a message to an Amazon Simple Queue Service queue with some metadata about the payload, including the path to the S3 object that contains the payload itself.

We chose to use a queue here to gain a few benefits:

It allows us to have higher availability. The Receiver is simple, and rarely changes so needs to be deployed infrequently. All of the logic is in the Handler, which does get redeployed frequently, but doing so doesn't interrupt event processing, since the Receiver continues to add events to the queue.
Having a queue gives us a simple retry mechanism, both in the case of a fault or transient error. We only remove a message from the queue once we have successfully processed it, and if we try to reprocess the event N times, we move the message to a dead letter queue. We can then address the fault (or wait for the transient error to resolve), then move the message from the dead letter queue back to the primary queue for reprocessing.

There is a trade-off for using a queue here, however: SQS provides at-least-once delivery (we will see every message), but does not guarantee at-most-once delivery (we may see the same message multiple times). To handle this, the subsystem is designed to process messages idempotently, so there is no harm in processing the same message multiple times.

We store the payload in S3 instead of including it in the SQS message because maximum allowed SQS message size is 256 KB, and we can't guarantee that all payloads will be less than that size, especially since we don't control the generation of those payloads (there are other benefits to using S3 for the payloads, as we'll see in a bit).

One drawback of this pattern is that, since we are no longer processing the event as part of the delivery request, we cannot signal the status of processing that event back to the provider that sent it. This means that you can’t use the provider’s UI to determine if the events are actually processing, since they will all have been enqueued successfully. In practice, this hasn’t been an issue since a processing failure would usually be a fault triggered by our implementation that we would detect and fix quickly.

Handler Service

The Handler service is where all the work happens. It is single threaded, and there is only ever one instance of the service running. This ensures that the same message is not processed twice, concurrently, removing the need for any coordination or locking mechanisms, simplifying the subsystem.

The Handler is divided into three layers:

Each layer takes data as input and emits data to the next layer (except for the Entity Processing layer, which writes to the database). The data inputs for each layer have a specification (via clojure.spec) that ensures the data is of the expected shape.

Translation Layer

This layer receives the webhook payloads from the providers (with one implementation per provider), and converts the event to an internal, common format, using the provider's API to resolve any missing data. For example, Bitbucket and GitLab don't send all of the commits in a push webhook if the number of commits is over a certain threshold - in those cases, the translator uses the API to get the full list of commits.

We use a common format as the output of the translation layer because the Bitbucket and GitLab webhook events have different shapes, and represent some concepts in very different ways. Having a common format lets us have one implementation for the rules that define what actions we will take based on events, enabling us to implement new features once, with all three providers getting the feature at the same time. To help achieve this, we encapsulate all of the provider-specific code in the translation layer, and the layer's sole responsibility is converting the event into the common format.

Once the translator has all of the data it needs, it emits an event that is in the common format that is passed on to the next layer for further processing.

Event Processing Layer

This layer implements the rules for processing events, converting the events it receives into a set of declarations about the state the database should be in once we are done processing the event. It includes a hint if the data should be transacted to the database always (an upsert) or only if the entity already exists (an update). It doesn't apply these declarations against the database; it just passes them on to the next layer for further processing.

There is a single implementation of this layer that is used for events from all of the providers, since the translators convert the webhook events to a common format.

Entity Processing Layer

This layer takes the declarations and turns them into a Datomic transaction, then applies that transaction. It takes advantage of a few handy Datomic features:

Transactions are generated as Clojure data structures without communicating with the database. This allows us to build them up using standard language functions.
Datomic will ignore attributes in a transaction if the attribute already has the value you are trying to transact. This simplifies the logic quite a bit, since the code that generates the transaction data doesn't need to check the value of attributes before trying to set them in the transaction.
Once a transaction has been applied to Datomic, we have an entity in the database that represents the transaction and can be queried like any other entity. This allows us to quickly identify all the changes that occurred from a given webhook event, since we have a one-to-one correspondence between a webhook event and a transaction (if the event actually resulted in a transaction being generated).

When this layer has finished processing an event, we're done with it, and we delete the message from SQS.

Lessons Learned

We learned a few things while building this subsystem:

clojure.spec

This system is the first significant project on which we have used clojure.spec, and it proved useful in a few ways:

Specifying the inputs to each layer in one place made it easier to visualize the shape of the data as it flowed through the subsystem, and proved valuable when describing the behavior to other engineers.
Instrumenting the inputs and outputs of our layer boundary functions (via Orchestra), during testing, enabled us to quickly catch cases where we deviated from the spec, allowing us to see the issue at that boundary instead of as a random error within the implementation.
Validating the layer inputs against our specs in production allows us to quickly diagnose and isolate faults to the layer where it was triggered, and helps us catch when our assumptions about the shape of the webhook event are incorrect.
Having specs allowed for more straightforward tests, since we didn't need to assert on the shape of the data.

Overall, having specs has increased our confidence in making changes, especially our specs for the webhook payloads, since it is difficult to capture all the permutations of behavior of the provider's systems in tests.

However, we found a few trade-offs when using clojure.spec:

clojure.spec requires registering a spec for each key in every path that you care about in a data structure, which can be verbose, especially for deeply-nested data structures. We felt this pain the most when specifying the incoming webhook payloads from the providers. We didn't specify the full payloads, we only specified the values that we actually needed, some of which were nested 7-8 levels deep. To work around this, we used data-spec (part of the spec-tools project) to define the payload specs as a mirror of the shape of the actual data.
clojure.spec's error output is concise, and it's not always immediately apparent where the validation failure is within the data, especially when validating a large data structure. To help with this, we used expound to generate friendlier error messages at the REPL and in tests.

Logging

We use structured logging throughout our backend subsystems, and log both branches of crucial decision points within the subsystem. We tie all of the log messages for an event together with a key that is generated when we first receive the webhook payload so that we can see all the messages for a given payload with one query. That allows us to easily see how we processed a given payload, which is valuable to our support team when understanding behavior when a customer has a question about it.

Storing payloads in S3

Storing the payloads in S3 made debugging faults much more straightforward - when we logged an error or observed unexpected behavior, we could inspect payload S3 object that triggered the event to identify the cause. We could then turn the payload into an input to test to ensure the fault was corrected.

Wrapping Up

Much of the work here was an experiment with techniques and technologies that were new to our overall system. Overall, we consider this a successful experiment and have already applied some of the lessons learned to other subsystems within Shortcut. We also now have a subsystem that is easier to understand (both operationally and at the code level), easier to debug, and easier to maintain. If you have any thoughts or questions on this system, we'd love to hear them in the comments, or on Twitter!

‍