Navigation

Blog|Engineering

The State of Online Schema Migrations in MySQL

By Shlomi Noach |

How do you run non-blocking schema changes in MySQL? This is an eternal question. With a plethora of 3rd party solutions and with recent advancements in MySQL, it's difficult to track which solution is preferable for a given schema migration. In this post, we provide a high level overview of the state of MySQL online schema migrations in 2024. We limit the discussion to ALTER TABLE statements, as other DDL statements are typically fast (DROP TABLE is somewhat of an exception, but out of scope of this post).

We'll first examine the native MySQL options: INPLACE and INSTANT. For reference, see Online DDL Operations MySQL 8.0 documentation.

INPLACE, aka InnoDB Online DDL

This is MySQL's first take on non-blocking schema changes. Some types of ALTER TABLE (see above link for exhaustive list of supported changes) are eligible to run with ALGORITHM=INPLACE. An INPLACE schema change is technically non-blocking, with quite a few caveats:

  • On the server where the query is submitted, normally the primary database server, DML queries (SELECT, INSERT, UPDATE, DELETE, ...) are non-blocking and may proceed to execute. Other DDL statements will block, and that's expected.
  • The operation is resource greedy: the MySQL server will use as much CPU and disk I/O to complete the change as it can. This can and will impact performance on busy servers.
  • It requires extra disk space, up to as much disk space as the original table.
  • It is uninterruptible. The only way to abort is to kill the query aggressively. This then leads to a further massive cleanup operation, consuming more disk I/O.
  • On replica servers, the operation is NOT non-blocking. Meaning if the ALTER TABLE took 3 hours on the primary server, then from the moment it completes you can expect replication to stall while applying that same change for the next 3 hours or so, creating a massive 3 hour lag.

The replication issue is a deal breaker for most. One way around it is to run the ALTER TABLE on the primary with SQL_LOG_BIN=0 so that it does not replicate. Then, run it similarly individually on each replica. This technique works, but can be the cause for inconsistencies. Did you track all servers? What if you subsequently restore (or bootstrap) a server from backup, where the change never took place? How do you track that? Moreover, this technique will take n times longer to complete, as you need to run the change individually for each server. You may parallelize some of the work, but probably not all of it.

INPLACE conclusions

For these reasons we find that INPLACE is not a good option for non-blocking changes.

INSTANT schema changes

Instant schema changes are almost a holy grail in the world of databases, and where it works, it's the next best thing after pizza (pending any bugs). MySQL offers support for some schema changes to run with ALGORITHM=INSTANT. Originally contributed to MySQL by Tencent six years ago, INSTANT DDL only supported a single type of change: ADD COLUMN. Later on MySQL added support for more changes, like expanding an enum column, or adding and dropping VIRTUAL columns. Recently (one year ago), MySQL 8.0.29 added support for arbitrary ADD COLUMN and DROP COLUMN support.

INSTANT truly runs instantly. It does not need to copy a table, does not need extra disk space, does not hammer the CPU. There's nothing to interrupt because the operation terminates before you've blinked. It also runs instantly on the replicas.

It sounds perfect! And it mostly is, where supported, and with a bit of nuance. Consider again the documentation for supported operations. Looking closely, you can see these types of supported changes:

  • Changing a column default value.
  • Adding/removing VIRTUAL columns.
  • Modify an enum definition.
  • And more.

What's shared to these changes is that they're all metadata changes. They do not affect existing rows, do not modify the data, do not restructure the table, do not affect indexes. ADD COLUMN & DROP COLUMN are the only supported changes that actually affect table data or how the data is structured. As another caveat, you cannot DROP COLUMN using INSTANT DDL if that column participates in an index.

So you can change a column's default value from 0 to 1, but you cannot make a nullable column non-nullable. You can add and drop GENERATED VIRTUAL columns, but cannot add and drop GENERATED STORED columns. You can modify an enum definition, but you cannot modify a column's type from int to bigint.

The list of unsupported changes includes:

  • Changing a column's data type.
  • Adding a column with non-literal default value.
  • Adding indexes.
  • Modifying a PRIMARY KEY definition.
  • Adding/removing foreign keys.
  • Changing a table's character set.
  • Making partitioning changes.

Which is to say, there's a long way to go before INSTANT DDL can satisfy the common needs of schema changes. Where possible, INSTANT DDL is wonderful, and in many situations is the preferable and recommended way to go.

INSTANT risks

INSTANT first appears to be risk-free. In most situations, it is! Even if you make a mistake, it can be corrected with a counter-INSTANT operation. That's true for most changes, except when data is destroyed. At this time, the one destructive statement support by INSTANT DDL is DROP COLUMN (for "real", non VIRTUAL columns).

Dropping a column has two main risks to it:

  1. The obvious risk of losing important data, if executed prematurely or accidentally.
  2. The risk of breaking existing queries.

Losing data can obviously be a massive incident, the cause for outage and for long hours or days of recovery. But why is this an INSTANT risk in particular? It's the same damage whether INSTANT or not, right?

The answer is with the human behavior of always choosing INSTANT where possible. You're an instant away from destroying your data, and with no barriers to hold you back, nor a mechanism (short of backups and delayed replicas) to take you back to safety. We'll discuss this shortly as we introduce the concept of Revertibility (specifically in Vitess).

How about breaking existing queries? Maybe the data is truly expendable, or safely aggregated elsewhere, but perhaps a bunch of SELECT or INSERT queries still reference the column?

MySQL offers invisible columns as a means to emulate how your table might look like without a given column. However, it is limited. It only affects queries that do not explicitly use the column name, such as SELECT * FROM my_table .... or INSERT INTO my_table VALUE (...). But any SELECT the_column FROM my_table query still has full access to columns. In today's world, SELECT * and blind INSERT queries are not as common. Frameworks, tooling, and modern engineering paradigms all tend to be explicit and fully qualified. Invisible columns does not help here.

If dropping the column did cause queries to break, you will then need to either fix all the queries, or attempt to re-introduce the column. Let's now discuss revertibility.

Revertibility

What are your options for undoing a change? For switching back to the previous schema? Let's illustrate using two simple examples.

Say your change was to ALTER TABLE my_table ADD COLUMN name .... This looks harmless, and yet can cause downtime. name can be a common column name. Queries selecting name in a multi-table statement, such as SELECT name, value FROM my_table JOIN another_table USING ..., could fail due to the new ambiguity of name column.

The anti-change for ADD COLUMN is a DROP COLUMN, and since both are supported by INSTANT DDL, chances are you'll be able to recover quickly and relatively safely.

What if your change was a DROP COLUMN? Lost data aside, what is the anti-change you'd apply to restore the previous schema? Not only data was lost, but also metadata. What was the column type? Length? Was it nullable? That information cannot be inferred unless you have the previous schema. In all likelihood, you use version control to manage your schema and are thus able to extract the previous definition. It is worth pointing out, though, that crafting the anti-change of a schema migration is nontrivial.

INSTANT conclusions

Where possible, INSTANT is often the best approach for making online schema changes. However, it is too limited at this time and does not support the majority of common schema changes. It does not provide revertibility in case of data destruction. The MySQL team does not publish concrete plans for INSTANT DDL support in future versions of MySQL.

Solutions external to MySQL

A number of 3rd party tools is available today for running online schema changes for MySQL. We will focus on Vitess, the technology behind PlanetScale's non-blocking schema changes. Other 3rd party tools include gh-ost, pt-online-schema-change, recent newcomer spirit, and others.

These tools all share a similar basic design, but operate differently. The major characteristics share to all are:

  • They mimic an ALTER TABLE by creating a shadow table with the new schema and slowly copying over data.
  • They can and often will take longer time to complete as compared with a native MySQL ALTER TABLE.
  • They require extra disk space, about as much as the existing table (less if you consider fragmentation, more if you're adding bloated indexes, etc.)
  • They cause binary log bloating (essentially the entire table content goes through the binary logs).
  • They respect production workload, and will pause or throttle as needed so as to give way to production traffic (hence they're likely to run longer).
  • They operate in small batches of changes, hence are able to keep replication lag to a minimum (and throttle based on lag).
  • They are interruptible: the operation can be aborted at no immediate cost (cleanup can be done at a later stage).
  • They are capable of handling almost every single kind of schema change.
    • Most have foreign key limitations.
    • Some partitioning options are not recommended, or are plain incorrect to run using these tools.

To put it out of the way: if your table has a color enum('red','green','blue') column, and you want to add a new enum value, making it color enum('red','green','blue','orange'), you're better off using INSTANT DDL. There are a handful such cases, that are supported by INSTANT DDL as mentioned above, and where it just doesn't make sense to spend hours of migration.

However, for the (still vast) majority of changes, these are still the go-to solutions. First, of course, we've already established that neither INSTANT nor INPLACE cover all types of changes (they cover a minority of possible changes). But this also leads to an emerging behavior: maintaining two different techniques in your flow/automation creates more complexity. If you already have to use one of the 3rd party solutions, you may as well use it all the time.

Both vitess and spirit go an extra mile and can auto detect when a migration can be fulfilled using INSTANT DDL, which means you don't need to think about it or be aware of which particular version supports which changes.

vitess further supports revertibility as first class citizen, able to not only revert back to the original schema, but also to preserve the would-be lost data, while still accounting for any newly added, updated, or removed data since the change.

Note on partitioning

Partitioning is a strange beast, and implemented in MySQL by creating a "small" table per partition. As such, operations on partitions are really operations on sets of tables. Some partitioning related changes should only be served by MySQL. Such is a DROP PARTITION statement for e.g. RANGE partitioned table. Some other partitioning changes are better served by MySQL, and some are best served by online schema change tools.

3rd party conclusions

Most use cases are best served (or only well served) by 3rd party online schema change tools, and those are still the way to go for the foreseeable future.