Changing The Engine While You’re Driving Down The Highway
We recently ran into an interesting and difficult problem: how do we change a large, heavily-used table without violating our “avoid downtime if possible” mantra?
The following is a slightly expurgated version of the postmortem I sent the rest of the team. Table names have been changed to protect the guilty.
In the Sinistar release’s post-release migrations list, there were two
ALTER TABLE migrations slated to make structural changes to the
customer_login tables (changing the types of some columns, specifically). When we tested these migrations on a copy of the production system’s data, we discovered that they would have caused those tables to be inaccessible for two and five minutes respectively. Since customers are core to our applications, we determined that this qualified as “downtime” in the release plan.
I prototyped an alternate approach to the
customer changes (outlined below) intended to make changes to the table without taking the app offline while they were happening. The alternate approach was considerably more complex than the original
ALTER TABLE statement, as well as being considerably slower (~15 minutes for our largest shard, as compared to two minutes); we discussed whether it was worth taking the app offline at zero-dark-hundred on a weekend instead and concluded that we’d need to be able to make online changes to expensive tables eventually and opted to go ahead with the alternate approach as a learning experience.
How Did It Work?
The alternate approach we used relied on more MySQL-side tools to make the changes. Specifically, we:
- Created a new, empty version of the
customertable (the “target”) table based on the current
customertable plus intended structural changes. Structural changes to an empty table that the app does not use are quite cheap.
- Installed triggers on
DELETEoperations on the
customertable that copied rows (using
REPLACE) into the target table, applying the intended transforms to the affected columns.
Installed a server-side procedure (“the migration procedure”) that:
- Selected all of the keys from the original
- For each key, copied the corresponding row from the
customertable to the target table using
- Selected all of the keys from the original
- Ran the migration procedure. (This part is where most of the time went.)
- Verified that the contents of the target table were in 1:1 correspondence with the contents of the original
customertable, where the only permitted differences between corresponding rows were the changes intended by the original migration.
- Renamed the original
customertable out of the way and renamed the target table to
- Verified that nothing had gone horribly wrong in our apps.
- Dropped the renamed version of the original
customertable, along with the triggers and the migration procedure.
The steps are grouped into four phases:
prepare.xml (covering steps 1-3),
execute.xml (covering step 4),
cutover.xml (covering step 6) and
cleanup.xml (covering step 8). The steps not covered by phases were performed semi-manually.
The changes to
customer_login proceeded along the same paths.
What Went Right
Well, first of all, we noticed a potential app outage hazard before it went live.
We were able to back out and rewrite the migrations without affecting existing development environments, where the original
ALTER TABLE might have already run, thanks largely to Liquibase‘s precondition support.
We ran through several versions of this plan, both on my development environment image and with QA (on our release candidate environment), which ferretted out a handful of subtle bugs in the migration procedure that would’ve damaged customer data. Our QA team’s extensive experience and automation tooling around our applications prevented any of those mistakes from going live.
Designing the hotfix in several phases permitted us to run the relatively safe phases over the weekend, with relatively little involvement from our operations team beyond monitoring.
The verification steps did not take out our applications in the process, and revealed a way to move at least some verification steps out of our applications’ scripts directory and into the
core-dbs source tree.
Finally, it worked. Despite the relatively high complexity and the large collection of moving parts, this alternate plan worked very well. Rather than taking our applications offline, everything stayed up (and didn’t slow down too badly).
What Went Wrong, And What We Learned
Finding Problems Late is A Problem
We didn’t catch the initial problem before it went out to everyone’s development environments. I’ve habitually not worried about migration timing until code freeze, which means that potentially-troublesome migrations will have already run on our bleeding-edge environment (and possibly our release candidate environment) before we notice the problem. This lead to increased complexity in the alternate migration path’s Liquibase configuration – we made extensive use of preconditions to ensure that the alternate migration would be harmless on systems where the original
ALTER TABLE had already run, which seems to have worked, but I’d rather not have had that problem in the first place as it’s another moving part that can have problems.
Concurrency is a Hard Problem
The first version of the migration procedure looked like this:
- Select all of the rows from the original
- For each row, insert that row into the target table using
The select and loop steps use a MySQL cursor to iterate over their results. While the MySQL documentation states that it’s undefined whether a cursor is over a snapshot of the query results or over the underlying tables, in practice we found that it was using a snapshot. Since the loop took considerable time to run, and since it (intentionally) did not prevent further changes to the original
customer table from occurring after the snapshot, the loop would insert, into the target table, a “stale” version of any
customer row that had been modified since the snapshot was taken (overwriting the correct version copied into the target table by the triggers on the original table).
We only noticed this on our release candidate environent. When we ran through the migration plan initially to ensure that the procedure did not block our applications’ normal usage, QA ran part of their Selenium suite, which produces a relatively high volume of test traffic. Out of ten thousand customers in the
release-candidate databases, four customers had stale rows copied. This is not a bug we would ever have identified using only manual testing (even with a verification tool), so I’m intensely pleased that we had automation in place and very glad I got QA involved in testing the hotfix as extensively as we did.
Switching from a snapshot of the rows in
customer to a snapshot of the keys in
customer and selecting out the row itself only immediately before copying it to the target table addressed the problem; even in production, our verification script identified exactly zero mismatches between the original
customer table and the target table (across ten million customers). It also made the procedure body somewhat shorter and simpler by reducing the number of fields copied from the cursor on each step of the iteration.
I’ve Got Ten Million Queries, Is This Bad?
Boy, is it slow. The migration procedure, which issues one query per affected
customer row, takes approximately ten times as long as a straight
ALTER TABLE, and produces considerably more IO load. This lead to the
customer hotfix running into Friday morning (we kicked it off Thursday afternoon). I made some changes to the
customer_login migration procedure to attempt to compensate – instead of issuing a
REPLACE query for each row, we used a prefix of
customer_login‘s primary key (
customerid) instead to issue only one
REPLACE query for each customer’s worth of rows. This seems to have helped, but not by much. Future iterations of this process might even go by larger groups of rows, depending on the average and worst-case number of rows per group.
The Best Laid Plans…
We ran through three iterations of the migration procedure for the
customer table before we settled on a working version. While I did design recovery migrations to back out the broken versions of the procedure that were intended to leave trunk and other environments unaffected by the iterative testing process, they didn’t quite work as planned. Somehow (and we still don’t know how for sure), all of the customers on our bleeding-edge environment were deleted outright instead of being migrated over. We weren’t able to reproduce the problem with the final plan, and we do keep backups of our testing databases, so we were able to recover the missing customers, but it’s another thing that we only noticed because of QA’s intervention.
We avoided this with
customer_login by not building the migrations for that table until we’d ironed out issues in the
customer migrations, and in the future we’ll avoid rolling this sort of hotfix into master (for our bleeding edge environment and development environments) until we’re confident that it’s correct.
No Space Left On Device
This is one that caught me totally by surprise. When we ran the migration procedure for
customer in production, we learned that passing 10m queries through the MySQL binary logs (for replication) uses up a lot of space. Fortunately, we keep the binlogs on a separate partition from the main data store, so this did not take MySQL out, but it could’ve if our Ops folks weren’t so sharp. This did, however, take replication out temporarily. Operations resized the partition the binlogs live on, which kept things running (slowly, but smoothly) while the
customer_login migration procedure ran.
Read Locking Is Not Your Friend
Our initial version of a verification tool, a PHP script in one of our apps’
scripts/ directory that ran against the live databases on our master database, would’ve taken our applications offline in roughly the same way as the original
ALTER TABLE statements would’ve. I only realized this when I ran the verification queries (outside of their script) against one of our replica databases’ shards and realized how long they took, shortly before we had planned to run the verification step in the hotfix plan. Fortunately, the verification used very little from our app’s environment; I ported it to a simple shell script and ran it against a replica (after replication caught up) instead. In the future, I’m going to try to get more of our verification scripts written that way where possible, since it does a good job of insulating our applications from side effects.
Working with large data sets is hard, especially in MySQL (where even reads can cause locks). Making structural changes to those data sets is even harder, especially while the application’s in flight. However, I feel like this went relatively well even with the “creative” issues discovered during testing, and we’ll be using this approach for large tables more frequently in the future.