Comparison of Online Schema Change tools
Design and implementation comparison of tools and the resulting user experience
Online Schema Change tools for MySQL share a similar basic design, but differ in implementation. Their differences imply different capabilities, different impact to production, different versatility, and different user experience. In this document, we break down several Online Schema Change tools to see how their design choices lead to different behavior.
We will compare these solutions:
Some notes before we take off:
- All reviewed solutions are free and open source
Facebook's OSCare designated (command line) tools for making schema changes.
- Vitess is a full blown sharding and infrastructure framework, where online schema changes (termed Online DDL) is one of its functions. As such, it is not an "online schema change tool" per se. In the context of Vitess's Online DDL, we assume all production traffic goes through Vitess's MySQL interface (the
VTGateproxy), and never directly to the underlying MySQL servers.
- It's noteworthy that Vitess can manage and run
pt-online-schema-changemigrations. In this comparison we look at the native Vitess Online DDL solution, based on VReplication.
- We discuss the original Facebook online schema change tool. Facebook (now Meta) have since redesigned and re-implemented the tool.
- The original online schema change tool, oak-online-alter-table, as part of the OpenArk kit, is discontinued and is not reviewed here.
pt-online-schema-changederives its design from it and the two share many properties.
- Another notable tool not covered here is Rails LHM.
For simplicity of naming, we will refer to
pt-osc, and to Facebook's OSC as
Comparison by design
This document compares the online schema change solutions by key design elements, and shows how those affect capabilities and usability. We do not compare benchmarks or specific performance metrics. We will share some production experiences, while taking care not to make blanket statements. Each tool behaves differently under different workloads.
Our breakdown compares these design principles:
- Synchronous vs asynchronous
- Trigger-based vs. triggerless
- Coarse vs. fine
- Atomic vs. punctured cut-over
- Controlled vs. arbitrary cut-over schedule
- Primary vs. shared vs. flexible key iteration
- Managed vs. unmanaged
- Auditable/controllable vs. not
- Resumable after failure vs. not
- Revertible after completion vs. not
- Declarative vs. imperative
Make sure to first read How Online Schema Change tools work.
Synchronous vs asynchronous
In a synchronous design, the ghost table is updated within the same transaction that updates the original table. If a new row is
INSERTed to the original table, then within that transaction we also
INSERT a row on the ghost table, and similarly for
DELETE. By virtue of transactions, either both statements apply, or neither does.
This design simplifies much of the logic. Once the table copy is complete, the synchronous design ensures the tables are in perfect sync, and will continue to be so if we were to keep the process going.
The only way to achieve this design is by delegating work to the database itself, so that we're able to hook into a user's transaction. It is impossible for an external tool to apply changes to the ghost table while still operating under the user's transaction. The only means to do that in current MySQL is using table triggers (see next design comparison).
In our reviewed solutions list,
pt-online-schema-change is the only one to use the synchronous approach.
In an asynchronous design, we apply changes to the ghost table asynchronously to the user's transaction. This means there is some lag between the user's
COMMIT and the time at which the reflected change is applied to the ghost table. That lag can be small or high, and is always non-zero. Even at the moment when table copy is complete, there is no certainty that the two tables will be in complete sync. Ongoing production traffic may keep pushing the lag.
The asynchronous approach implies we need to capture the changelog. We need some buffer where we can track this
INSERT or that
DELETE, which can sustain a growing lag.
The two approaches to achieving asynchronous schema changes are:
- Capture data change via triggers and use a helper table for capturing the changelog — used by
- Tail the MySQL binary logs, which are the changelog — used by
However the changelog is captured, a background process needs to iterate the incoming changes, to then apply them onto the ghost table.
Trigger-based vs. triggerless
Triggers are MySQL constructs based on stored routines, which allow us to respond to
DELETE statements, either before or after they have been applied to the table. A trigger is invoked per row changed (created, updated or deleted) on a specific table. The invocation takes place in the same transaction space as the query which operated on the table.
Triggers make it easy to keep things in the database level. Applying an
UPDATE to the ghost table does not leave the server's space. This means we can let MySQL handle the way data is copied. For example, if the schema change converts a column type from
BIGINT, or from
VARCHAR, we don't need to worry about the type conversion, and we let MySQL do its natural conversion.
Stored routines were introduced in MySQL
5.0 and have changed little since. A stored routine (procedure, function, event, trigger) is interpreted by the database. As opposed to other RDBMS, MySQL does not compile or pre-compile stored routines. Every time a stored routine is invoked, the code is interpreted by the database server. This means a user's transaction gets overloaded, or padded, with interpreter logic, and on high workloads this is noticeable to the app.
Another limitation of the trigger-based design is that the triggers must be active throughout the lifetime of the migration. Online schema change solutions attempt to keep migration impact to production as low as possible, and throttle based on production load. However, the triggers must keep running, since they are the means to capturing changes, or else we have data loss. They cannot be throttled.
fb-osc use triggers, but in different approaches:
pt-oscuses the synchronous approach. Therefore, an
INSERTon the original table uses the trigger to also
INSERTinto the ghost table in that same transaction. While this maintains perfect sync between the two tables, it also introduces excessive locking. While normal transactions only intend to compete over writes to the original table, with the added triggers they also have to compete over writes to the ghost table. On high workloads this can bring down application's ability to write to the table to a considerable slowdown or even to a grinding halt. Again, this is workload dependent and for some workloads this works well.
fb-oscuses the asynchronous approach. The tool creates a changelog table, an append-only buffer. Each
DELETEon the original table translates to an
INSERTon the changelog table. This reduces somewhat the locking contention since all changes map to new (and sequentially ascending) rows in the changelog table rather than existing (and random order) rows on the ghost table.
fb-osc uses the asynchronous approach, it still keeps the data in the database itself, and again enjoys MySQL's ability to reliably copy the data from the changelog table onto the ghost table.
In the triggerless approach, the changelog cannot be acquired directly from the running transactions, and must be collected elsewhere.
gh-ost use a triggerless design, and both get the changelog from MySQL's replication stream.
With binary logs enabled in MySQL, once a transaction completes its effects are written to the binary logs, and are made available to the replication stream.
- How the transaction is written in the binary logs depends on the binlog format. Both
gh-ostis more flexible in allowing a
STATEMENTformat on a primary with
ROWformat on a replica.
- Depending on MySQL's configuration, the transaction and binary log entry couple be written atomically, or, there could be lag between the
COMMITand the binlog entry flush.
The binary log is, in essence, the changelog of operations made on the MySQL server. It is the means for replication, where replicas replay those changes to achieve consistency with the primary MySQL server.
It is possible to read the binary logs directly from disk, but both
gh-ost choose to hook to the MySQL server imposing as replicas, and requesting the binary log events over the wire. This allows flexibility in where the schema change logic can run.
Reading data from the binary logs / stream means the schema change solution owns the interpretation or mutation of the data. It is no longer possible to delegate that work to MySQL. For example, texts are always written in UTF in the binary logs, regardless of the character set they originated with in the table. Conversions between types (again illustrating an
varchar change) need to be taken care of by the solution's logic. This makes a triggerless design more complex in handling of data and the numerous ways with which it can mutate.
A triggerless design is inherently asynchronous. Capturing binlog changes from MySQL's replication stream, or even from disk, is not coupled with transaction commit time. There will be some lag between the time where an
UPDATE is in the binary log stream, and time at which it is read from the stream.
Read more on why
gh-ost took a triggerless approach.
Coarse vs. fine
gh-ost all work in a coarse approach. To some extent it's a brute-force method of copying the data:
- They iterate the original table and copy chunks of rows to the ghost table.
- They respond to ongoing changes (whether sycnhronously or asynchronously) and apply them to the ghost table.
The two operations could conflict. What if we applied a row by capturing the changelog, before we copied it via table iteration? What do we do when we then try to copy that chunk of data? In a synchronous operation, what if that actually happens concurrently in competing transactions?
The coarse approach gives the changelog a higher priority over the table copy, and lets both run independently of each other. In a normal migration, there will be many conflicts between the row-copy and the changelog. They are resolved using database queries, simplified to:
INSERT IGNORE INTO ghost_table (...) SELECT ... FROM original_tablefor table copy, and
REPLACE INTO ghost_table (...) ...for changelog capture.
gh-ost only ever runs table copy or applies changelog at a time, and there is never a competition between two concurrent transactions.
Vitess takes a fine approach where it keeps track of:
GTIDwhile writing to the ghost table. It keeps a very small progress table to which it audits the
GTIDfor the latest write to the ghost table. This is done atomically within the same transaction.
- Last copied range —
Vitessevaluates, programmatically, which key range has already been copied. The key range is again persisted in a helper table, atomically with the write to the ghost table.
Vitess operates in a three step loop:
- Copy — opens a transaction with
CONSISTENT SNAPSHOTwhile atomically grabbing the
GTIDat time of the transaction. Begins streaming rows from the table.
- Catchup — scans the changelog.
- Ignore anything not related to the original table
- Ignores anything that is already covered by the
- Ignores operations on rows which have not been iterated yet.
- Fast forward — prepare the next Copy phase, providing it the next unhandled row range, and, since takes non-zero time to achieve, backfill the remaining changelog events as with the Catchup change.
With this fine grained approach, Vitess is able to avoid superfluous writes to the ghost table, and only keep to the minimum necessary data copy.
Atomic vs. punctured cut-over
The cut-over is the most critical part of the migration process. It is when the new schema takes effect, and is a source of concern:
- If we did anything wrong (dropped the wrong index? Wrong column?), that's when the application may break or degrade performance.
- There is non-zero impact time for instating the new table. Queries will either stall, break or both for a hopefully brief period.
gh-ost both utilize an atomic cut-over. In this approach, the app or user's queries on the original table are briefly blocked, but not rejected, during cut-over time. When the ghost table is renamed in place of the original table, the queries get unblocked and proceed to operate on the new table.
On high workloads this is typically noticeable to the app. There will be some delays to queries. Moreover, there will be some pileup of connections. This is because an open connection on MySQL can only run a single query at a time. A
1000 blocked queries will mean a
1000 open connections. The busier the workload is, the higher number of piled up connections will be. Eventually, the database may run out of open connection limit (either by its own configuration or by OS limits), at which time apps and users will experience rejected connections or queries.
pt-osc's cut-over is the simplest of all. Becuase of its synchronous design, it is able to cut-over with a single two-table rename statement, such as
RENAME TABLE original TO _original_old, ghost_table TO original. However, the metadata lock acquired also applies to the triggers, and under heavy workload the cut-over becomes a dangerous point of contention that can lock for long seconds or minutes.
gh-ost uses the asynchronous approach, where there is some lag between the original and ghost tables. It implements an elaborate logic to block queries on the original table while still applying last backlog changes to the ghost tables. Users have shown that under very specific query sequences, in particular monitoring-like queries that happen to access the ghost table just before/at cut-over time, there are scenarios that introduce data loss.
gh-ost also supports a two-step cut-over, as illustrated next.
fb-osc, also using an asynchronous approach, tackles the problem by creating a puncture. It runs a two-step flow where it first renames the original table away, and then, after processing the backlog, renames the ghost table in its place. There's a period in time where the table just does not exist. During that period in time, queries will fail, unexpectedly to the app/users. It is a simple and effective solution to the asycnhronous complexity, at the cost of user/app inconvenience.
Vitess utilizes the fact that traffic goes through its own
VTTablet components. Behind the scenes, it creates a puncture, much like
fb-osc. But during that brief time, it buffers any read/write queries that operate on the migrated table, up to a threshold. The user and app will experience an atomic, blocking cut-over. Queries will see increased latency and will block until cut-over is complete. Beyond some concurrent connections threshold, Vitess will reject queries, and the apps will notice an error.
Controlled vs. arbitrary cut-over schedule
In an arbitrary cut-over schedule design, the Online Schema Change solution decides when to cut-over, and it does so in the earliest possible opportunity.
fb-osc take that approach.
pt-osc will cut-over as soon as table copy is complete, and
fb-osc will do the same, allowing the extra brief time for processing the remaining backlog.
Since cut-over time is the most critical part of the migration, there is advantage to having people at their console and in their comfortable working hours when it takes place. Waking up at
2:00am on a Saturday night is less than ideal for dealing with outages and incidents.
Vitess support arbitrary schedule, but both also allow the user to decide when to cut-over. In this controlled approach, the migration process does not terminate voluntarily by the tool/solution. After table data is copied, and even when all seems to be in order, migration simply continues running. The tools keep tracking and applying the changelog to keep the ghost table in near sync (both
Vitess use the asynchronous approach). At some point, the user may declare they are ready. Assuming conditions allow, or as soon as possible afterwards, the tools will proceed to run the cut-over.
gh-ostprovides a -postpone-cut-over-flag-file command line flag, and cut-over either when the file is removed or the user interactively commands the cut-over.
Vitessmigrations can be postponed with
-postpone-completionflag, and completed via
ALTER VITESS_MIGRATION <uuid> COMPLETE.
Controlled cut-over is a means for higher confidence and peace of mind for the operational user.
Primary vs. shared vs. flexible key iteration
Original table's data is copied to the ghost table by slowly iterating all rows. That takes place according to some order, and, in particular, according to some
UNIQUE KEY order. Ideally, we'd always iterate by
PRIMARY KEY, but sometimes the
PRIMARY KEY itself is being modified by the schema change, which requires us to look for alternatives.
Similarly, when we capture, say, an
UPDATE event in the changelog, we need to uniquely identify the relevant row in the ghost table. There must be some way to ensure we're addressing the correct row, even in face of a change in
The Online Schema Change tools differ in implementation as follows:
fb-oscwill only ever iterate on
PRIMARY KEYorder. It allows original and ghost tables to have different
PRIMARY KEYs, but required the
PRIMARY KEYon the ghost table to only cover columns which exist in the original table.
pt-oscallows any type of
UNIQUE KEY. Any change of
PRIMARY KEYrequires the user to add
--no-check-alter. It allows the ghost table to have
UNIQUE KEYs on columns not present in the original table, and will in such case use unindexed columns for
DELETEoperations, which is a major performance hit on large tables.
gh-ostrequires some shared key. There must be any
PRIMARYincluded) on the original table, which covers exact set of columns, and in same order, as some
UNIQUE KEYon the ghost table. The name of the key does not matter, just the identity and order of covered columns.
Vitessrequires the original table to have some
UNIQUE KEYand the ghost table to have some
UNIQUE KEY. the two do not have to be the same. Vitess requires columns covered by the original table's key to exist in the ghost table, and columns covered by the ghost table's key to exist in the original table.
Auditable/controllable vs. not
pt-osc are non-auditable and uncontrollable. Once they start running, they're on auto-mode and will pursue the migration to completion or until it fails. It's impossible to meanwhile control the tools, change their behavior, query for information etc.
gh-ost are both controllable and auditable.
gh-ost opens a Unix socket, by which the user may communicate with the running migration. It is possible to reconfigure
gh-ost while running (by e.g. setting new throttling thresholds), manually throttling it, getting status, cutting-over and more.
Vitess lets the user communicate with the migration via SQL, with such commands as
SHOW VITESS_MIGRATION LIKE '...' (get information and status),
ALTER VITESS_MIGRATION '...' CANCEL (abort a running migration),
ALTER VITESS_MIGRATION '...' COMPLETE (cut-over a postponed migration) and more.
Vitess provide extra information about the migration, such as estimated progress percent as well as ETA (expected time till completion).
Managed vs. unmanaged
gh-ost are concerned with making an
ALTER TABLE (or emulation thereof) happen, and are not concerned about how the migration was made or scheduled, or what its consequences might be. Management of the migration flow must take place externally to the tools. The tools usually require multiple command line flags to pass information such as the server location, credentials, throttling information and thresholds, etc.
Vitess, as a framework, manages many aspects of the migration:
- Scheduling — Vitess schedules migrations to run, identifies migrations that conflict with each other, moves migration to the next state, etc.
- Discovery — Vitess automatically knows where the migration should execute.
- Credentials — Vitess can create and destroy credentials per migration (and does so for managed
pt-osc), or use its own internal credentials for the task.
- Throttling — built in Vitess, and again based on internal discovery.
- Cleanup — safely and timely removing the migration's artifacts.
- History — tracking all present and past migrations.
Resumable after failure vs. not
pt-osc are single process tools and do not keep track of their own progress outside the process' scope. If the tools break for any reason, or killed by the operating system, the migration is lost. They cannot be resumed from point of interruption. Someone will then also need to clean up the artifact tables. With
fb-osc, there is furthermore the issue of dropping the leftover triggers.
Moreover, they are bound to the MySQL primary the migration was issued on. If the primary fails, the migration is lost. Even if, for example, a replica is promoted as a new primary and the MySQL cluster resumes operation, it is impossible to resume the migration from point of breakage.
Vitess migrations survive both failure of the Vitess process as well as failure of the MySQL server. This is the result of Vitess keeping track of the migrations' progress in helper tables, transactionally committed along with corresponding range and GTID values. Upon failure of either MySQL or Vitess itself, a new Vitess/VTTablet process is able to pick up the migration from the exact place where it left off. This works even if a new MySQL replica/server is promoted as primary. Vitess resumes the flow automatically and the user does not need to take action.
Revertible after completion vs. not
Some migrations turn bad. Perhaps dropping the wrong column or the wrong index. This is why the cut-over is the most critical part of the migration. In such situations, the operator may want to revert the migration, i.e. restore the table to its original format.
Once a migration is complete,
gh-ost exit. What information they had about the migration or the table's structure is lost. Since they all have no sense of migration management, there is no functionality to assist with reverting the migration. The user must contrive a new
ALTER TABLE statement and run a new migration, or alternatively find what the original table was renamed to, and rename it back, losing some data accumulated since migration completed.
Vitess migrations are revertible. Vitess is able to restore the table onto its original schema, without losing accumulated data. Furthermore, it requires no additional information from the user. It provides a
REVERT VITESS_MIGRATION <uuid> SQL interface, which creates a new migration. That migration is expected to run quickly, as it does not need to copy table data. It only needs to hook onto the binary log stream from the point of previous migration termination and apply data from that point on.
Declarative vs. imperative
gh-ost require an
ALTER TABLE statement. This statement is pre-evaluated by the user (or an ORM) and must be valid. The tools have no notion of the purpose nor semantics of the change.
Vitess likewise supports
ALTER TABLE statements. In addition, it also supports a declarative approach. In this approach, the user issues a
CREATE TABLE to indicate what the table should look like. When the migration is scheduled to run,
Vitess compares existing schema with desired schema, computes the required
ALTER TABLE and runs the migration. Possibly the table is already in the desired format and no changes are needed. Possibly the table does not exist in the first place and needs to be created.
Vitess will change the schema as needed to get to the desired state. Declarative migrations are available via the
-declarative strategy flag.
As you can see, there are several crucial differences when it comes to the implementation of Online Schema Change tools. Hopefully this guide gave you insight into the reasoning behind the design choices of the tools we reviewed.
If you're interested in learning more about Vitess schema changes, be sure to check out the Vitess documentation.