Improving the reliability of WebHooks
At FreshBooks we dog food our API and WebHooks, which is how we noticed a curiosity in our implementation. Repeater, our WebHooks engine, is designed to retry a WebHook if the endpoint URI is temporarily inaccessible, i.e., it receives a non 200-series response code– unfortunately, Repeater didn’t repeat.
In its original implementation, Repeater maintained an internal priority queue of all WebHooks that must be retried, and a pool of threads which work on items in the queue. This implementation supported prioritization of first time WebHooks over retries as well as exponential back off for each subsequent retry; however, this scheme suffers from two major drawbacks. Firstly, the priority queue is maintained within memory and all retries are lost if Repeater is restarted, likely during upgrades; secondly, it uses multi-threaded code which is hard to debug and maintain.
Given that Repeater was not performing one of its fundamental operations consistently, and RabbitMQ is better suited for durable queues, we decided to replace the internal priority queue with a simple queue on RabbitMQ. We also replaced the pool of workers with a newer version of Sparkplug which supports multiple processes out of the box. In the new implementation, Repeater attempts to post a WebHook to a URI, if the attempt fails, the attempt, and all related information is inserted into a second queue. Then, a slightly modified version of Repeater reads the message off of the queue, sleeps for ‘N’ seconds and attempts to post the WebHook again. If the message fails, it is reinserted into the same queue; this process is repeated ‘K’ times after which the attempt is discarded.
The new implementation was deployed into production several months ago. It is more reliable, easier to maintain and test. While Repeater no longer supports exponential back-off, we gained the ability to introspect the retry queue in production. More importantly we no longer lose the retry queue upon an upgrade or restart.