Using triggers to un-shatter a schema
We deleted millions of customer database tables and nobody noticed. And that’s a good thing. It just took some thinking (a bit of reading), and 10 months of slow-but-steady work.
FreshBooks is a classic multi-tenant application. Customers sign up with us and live in their own little island of data (received invoices and contractors are treated like telephone calls to the next island). When the first version of FreshBooks shipped (and for many years after), every new account creation resulted in a new set of database tables being created. This worked well in the early days. Every time a web page was served by PHP, it would define the names of all the tables that that account needed to access, so you never had to use a WHERE clause to restrict data access to a single account. Early on in an HTTP request we converted an account name (e.g.
yourcompany.flashmind.info) into a numeric ID (e.g.
123), and tables were named using that id (e.g.
But times change. FreshBooks grew, and with it our data. By 2010 we were living with MILLIONS of database tables (yes, literally on the order of 1E+6). Databases are designed to deal with a few thousand tables, each with millions of rows, not a few million tables with a handful of rows each! Daily database server backups started to take 18 hours. MySQL stores MyISAM tables as 3 or more files on the filesystem, and 3 times MILLIONS of files takes a very long time for the operating system to open and copy. Per-table metadata overhead was often larger than the data described. Data warehousing was also becoming horribly slow, and we were restricted to analyzing a fraction of our data to keep the impact to a minimum.
We also ran into MySQL design decisions that performed poorly on our schema. MySQL has a that contains recently used file handles. The documentation suggests that the cache should be larger than the number of tables plus the number of concurrent accesses to those tables (each concurrent reader requires its own handle). However, with millions of tables that’s just not possible. The kernel wouldn’t let us hold that many files open, even if we had enough RAM. When we got bigger database servers a few years ago, increasing the size of the table cache from 100 000 to 200 000 file handles actually decreased performance because there’s an exclusive, global lock held during cache eviction, which is implemented as a naïve LRU. Doubling the size of the cache approximately doubled the time it took to evict the oldest cache entry! That in turn halved the rate at which we could open not-recently-used tables making your first log in of the day slower, and seriously hurt “background” jobs, affecting interactive performance drastically. At the time, we ended up turning the cache size down to 1000 to get reasonable interactive performance through all our workloads.
The “WTF”s from new hires didn’t help either. We had to take action.
The Schema Transformation
Take a simple association table that described which online payment gateways were allowed for a given invoice for FreshBooks account number 123:
CREATE TABLE `org123_allowed_gateway` ( `allowed_id` int(11) NOT NULL auto_increment COMMENT 'A synthetic PK', `gateway_id` int(11) default NULL COMMENT 'FK to gateways', `invoice_id` int(11) default NULL COMMENT 'FK to invoices', PRIMARY KEY (`allowed_id`) -- And maybe some indexes, but probably not; there usually isn't enough -- data to make them worthwhile ) ENGINE=MyISAM [DEFAULT CHARSET=latin1](https://flashmind.info/developers/blog/utf-8-is-here);
Because there’s one per account, we called these “shattered tables” (as compared to the more traditional “sharded tables”).
In order to use the database server the way it was intended, we wanted to convert it into a table that looked more like:
CREATE TABLE `invoice_allowed_gateway` ( `account_id` int(11) default NULL COMMENT 'FK to accounts', `allowed_id` int(11) NOT NULL auto_increment COMMENT 'A synthetic PK', `gateway_id` int(11) default NULL COMMENT 'FK to gateways', `invoice_id` int(11) default NULL COMMENT 'FK to invoices', PRIMARY KEY (`account_id`, `allowed_id`), KEY `allowed_id_idx_for_autoincrement` (`allowed_id`), KEY `account_invoice_idx` (`account_id`, `invoice_id`) ) ENGINE=MyISAM [DEFAULT CHARSET=latin1](https://flashmind.info/developers/blog/utf-8-is-here);
Primary key ordering
account_id was the first part of the PK, because (almost) all queries ask for data with a single
account_id value. This index ordering improves the specificity of a query including an
account_id restriction (most accounts had, say, a few hundred invoices; but we had thousands of accounts with
invoice_id = 1!). Because we’re coalescing data from many accounts into one table, there were duplicate
allowed_id values. As an additional transformation, we could have renumbered
allowed_ids, but many of our tables’ PKs were FKs, so it got messy quickly. We decided that it was not a good idea to do 2 transformations at once because of this added complexity. Consistency is a prerequisite for pretty.
MySQL and auto_increment with composite primary keys
The index on
allowed_id told MyISAM that we didn’t want to re-use
auto_increment values if rows were deleted. Without
allowed_id_idx_for_autoincrement MyISAM would use
MAX(allowed_id)+1 WHERE account_id = X, which would re-issue values for
allowed_id if we ever deleted a row with
MAX(allowed_id) GROUP BY account_id. Again, if you have any FK references (particularly in something like an activity log) this can wreak havoc. Remember that MyISAM doesn’t actually have Foreign Keys, so cascading deletes aren’t provided for you. It might be better not to delete rows, but non-repeating
auto_increment values are nice in their own right, too.
SQLAlchemy and autoincrement with composite primary keys
When using , we had to set
autoincrement=False on the
account_id column definition. If it’s not there, SQLAlchemy will assume that it is
auto_increment, leading to strange, un-obvious errors.
The Hard Way
In early 2010 we bit the bullet and started fixing our schema the hard way. We migrated a few tables this way:
- Rewrote all the SQL queries to look at the new table, being very careful to add an
account_id = Xrestriction to the
WHEREclause of the query. This was tricky, particularly when there were joins and/or sub-queries involved.
- Primed the database server’s disk cache. Most table data wasn’t in memory at any given time, and IO was the slowest part of this transformation. The first time running a data migration took about 3h for one set of shattered tables. After clearing out the shard table, a second run took only 15 minutes!
- Then we shut down the website while the migration executed! It sucked for our users, at 15 minutes a pop. At this rate we’d have had bi-weekly downtime for a couple of years before finishing.
The Easy Way
If you’ve read our previous posts on using triggers to effect seamless data migrations, you know where this is headed. If you haven’t read them yet, now is a great time. After fixing the code, database triggers allowed us to create and maintain an up-to-date copy of our data in the desired target form. For each shattered table we:
- Created the shard table.
- Created triggers on all the shattered tables, so that inserts, updates, and deletes propagated to the shard table. The triggers had to be written so that they couldn’t fail (i.e. used
REPLACE INTO), because the target data wasn’t guaranteed to exist there yet.
- Ran a backfill, basically:
REPLACE INTO invoice_allowed_gateway (account_id, blah, blah) SELECT 123, col1, col2 FROM org123_invoice_allowed_gateway;
Note that because the source table did not have an
account_id, we had to provide the literal
123 in the query.
- Every time a new account was created with the old shattered tables, an AMQP consumer would add the triggers and run the backfill for the new account. Adding code to do that in the same place as account creation would mean giving the MySQL SUPER privilege to code that doesn’t otherwise need it; it’s an unnecessary security risk. Our shard table is now eventually consistent!
Once the new code referencing the shard table had been deployed to production, we dropped the source “shattered” tables at a leisurely pace. (You don’t even need SUPER privilege to drop tables with triggers on them!)
It seems like old hat to us now, but this particular schema transformation was the first one we tried using triggers. Instead of taking years, it allowed us to completely transform our database in 10 months with no downtime, while at the same time delivering new features and bug fixes to customers.
But, the best part of our final approach is that customers didn’t notice a thing because we did our data transformation without requiring downtime. We were confident in doing it this way because we were able to verify the data transformation before committing to it. On one occasion we did notice an issue with the transformation and fixed it before the problem could affect customers.
Operationally, finally having a “normal” database schema led to many happy outcomes. Memory usage for our Python backends dropped about 25% (on the 12th; I don’t honestly remember what we did on the 16th to improve things further):
CPU usage and response time from our Python backends also dropped by about 30%. Given other changes that we’ve made over the years, I suspect that most of this is due to the smaller memory footprint and Python’s circular garbage collector. The more Python objects you have in memory, the harder the GC has to work to find cycles and orphaned weak-references. Python’s GC is an entire topic unto itself, which deserves its own future blog post.
Disk usage on the database servers dropped dramatically, particularly for tables that had very little in them per-account (which is most of them). Consider the on-disk size of a table containing a single integer:
mysql> CREATE TABLE foo (an INT PRIMARY KEY); mysql> INSERT INTO foo VALUES (1); ... $ ls -l foo* # in bytes -rw-rw---- 1 mysql mysql 8556 Jan 31 12:42 foo.frm -rw-rw---- 1 mysql mysql 7 Jan 31 12:43 foo.MYD -rw-rw---- 1 mysql mysql 2048 Jan 31 12:43 foo.MYI $ tune2fs -l /dev/mapper/VolGroup00-root | grep Block\ size Block size: 4096 $ ls -ls --block-size=4096 foo* # In minimally-addressable filesystem blocks 3 -rw-rw---- 1 mysql mysql 3 Jan 31 12:42 foo.frm 1 -rw-rw---- 1 mysql mysql 1 Jan 31 12:43 foo.MYD 1 -rw-rw---- 1 mysql mysql 1 Jan 31 12:43 foo.MYI
That’s 20kB to store 32 bits! This overhead adds up quickly both in terms of mostly-empty disk blocks and used inodes. Getting more actual user data per gigabyte of RAM is always a win.
When I started at FreshBooks, we didn’t think we could actually fix this. It’s been two years to the day since we deployed the last “de-org” code, and a large part of our current dev team doesn’t even know what an “org table” is. And that’s a good thing.