A Good Problem to Have: Scaling Recurring Billing on FreshBooks
October 18, 2010
For any application that has been growing exponentially for a few years, you run into some scalability issues. Recently, we took our slow, legacy recurring billing solution, rewrote it to take advantage of concurrent programming techniques, and increased throughput three and a half times, without disrupting our existing customers.
With your FreshBooks account, you can setup recurring billing to periodically charge your customer’s credit card. It’s a powerful feature that was supported by a simple cron job. It would run through every FreshBooks recurring profile and check if the invoice date matched today’s date, which as you can imagine, would take some time. When it was written it was the right solution.
As our customer base grew, the cron job took longer and longer to finish, eventually taking over seven hours to finish processing all the invoices for the first of the quarter. It might not sound like a horrible problem, but during those seven hours, users were in the dark about their billing status. We are committed to providing a fantastic experience for our users and we knew we had to make it better.
A Nerd’s Dream
There are a million ways to solve this problem, for a scalability nerd like me, I was in heaven! A good solution needs to be extensible and scalable, should leverage our infrastructure, help pay off our code debt and obviously run faster.
We needed a solution that generated invoices faster and positioned us to handle future growth. Before tearing apart our billing infrastructure, it was important that we understood the problem first. Fortunately, the cron job was setup to record payment gateway response times and memory usage along with other metrics. A quick look at the data and it was immediately obvious that the bottleneck was time spent waiting for payment gateways to process credit card transactions. Therefore, running these transactions in parallel would give the best throughput.
Writing concurrent code is difficult, but there was a conceptual breakthrough when we identified atomic actions or tasks in our recurring billing cron. We quickly broke the recurring billing cron into three different tasks: late payment notifications, invoice generation and retrying failed payments. From there, the next obvious step was to create classes that implement the same executable interfaces and captured the essence of the work being done. Once we had classes that did one thing well and were unit testable, we gained the confidence to move on to the more complicated, “concurrent” architectural issues, or so we thought.
The Internal Web Service
As useful as the the classes were, they still relied on our existing PHP code base, and while we are moving away from PHP, much of our invoicing code still exists in the PHP application. We decided to expose the tasks through an internal web service, letting Apache execute the PHP code in parallel and giving us a migration path towards Evolve, the Python application which is replacing our legacy PHP code.
Once we broke down the function into “Tasks,” the idea of workers harvesting from a queue felt natural; each worker would take a task off of the queue, process it and grab another one until the queue was empty. Luckily, we already had RabbitMQ in our infrastructure, and our extensive experience with Python led us to use Sparkplug. Sparkplug is a Python application that abstracts all the connection logic away from the developer and allows them to focus on handling the messages.
Pieces Fall into Place
The final step was to create a cron job known as the dispatcher to fill the queue with all the tasks related to recurring billing. To further improve throughput and performance, the dispatcher only adds jobs to the queue for FreshBooks accounts that need to send out late payment reminders, create invoices or retry failed payments for today.
The final process looks like the graphic above. A Sparkplug application known as Tolar spawns workers. Each worker then takes a message of off the queue, transforms it and posts back to our internal web service. The internal web service then executes the task, creating an invoice or sending a late payment reminder a customer’s client. As work is dispatched to the queue, Tolar workers immediately start harvesting tasks, posting them to Apache which executes them in parallel.
Over time, as the workload increases, we can add additional Tolar workers to keep the billing time constant. Incidentally, if the servers are under a heavy load, we can reduce the number of Tolar workers that process jobs from the queue.
After heavy internal testing, including a load test with production data to verify the solution would give us the speed-up we desired, it was time to roll out the new solution into production.
With such an extensive change to our billing infrastructure, and so many customers relying on our recurring billing solution for their business, it was critical that we tested the new code in production before we migrated our customers to it.
We created several FreshBooks accounts in production that tried to simulate real world accounts, then we modified the dispatcher to only dispatch tasks for those dummy accounts. We manually checked each account, making sure credit cards were charged properly, late payment reminders were sent quickly and all failed payments were retried.
As our confidence grew, we added a few more FreshBooks accounts to our beta group. We focused on accounts that had administrators that closely monitored their accounts and needed to generate multiple invoices daily during our beta period. With each passing iteration we migrated more and more customer accounts to the new infrastructure, finally taking it out of beta two months ago.
The Big Test and the Results
On the 1st of any month, there is a very high volume of recurring billing. And on the 1st of a quarter, it’s even higher. Because of the volume, it has typically taken seven hours to run all of FreshBooks’ recurring billing. So what happened this past quarter? Drum roll please… on October 1st, it took two hours – that is three and half times faster!