How Online Schema Change tools work
Technical overview of the concepts behind online schema change logic
Changing a table's schema is one of the most challenging problems in relational databases, and in MySQL in particular. In today's accelerated and rapid development cycles, engineers find that they need to make schema changes sometimes on a daily basis. Direct
ALTER TABLE is a blocking operation, which renders the table completely inaccessible, even for reads.
MySQL's InnoDB engine supports Online DDL for most operations. However, it is only online on the primary. On replicas, it executes sequentially to other statements, leading to unacceptable replication lag. On the primary, it is aggressive and uncontrollable. It will compete with production traffic over resources. InnoDB also offers Instant DDL, which is available for a small subset of schema change operations. See the full breakdown on the MySQL documentation.
In today's world, the vast majority of users use one of the Online Schema Change tools. These are external, small tools or scripts, that hook onto the database and run a migration through emulation. Common external tools are gh-ost, pt-online-schema-change and Facebook's OSC. This document describes the general technical operation of these tools.
It's worth noting an alternative to using Online Schema Change tools, via rolling changes on replicas. In this technique, the user runs either straight or Online DDL on replicas, typically one or a few replicas at a time. Then, once all replicas have been updated with the new schema, the user demotes the primary and promotes one of the replicas. The demoted primary, now a replica, can then be updated, as well. This technique requires no external tooling. However, it takes substantially longer to complete the overall operation and requires running a promotion, which is typically disruptive to the applications.
The Online Schema Change tools have many differences, covered in our Comparison of Online Schema change tools doc. However, they all share a similar fundamental logic and flow.
All tools work by emulating a schema change. The general flow is:
- Create a new, empty table, in the likeness of the original table. We title this the ghost table.
ALTERthe ghost table. Since the table is empty, there is no overhead to this operation.
- Validate the structural change is compatible with tooling requirements.
- Analyze the diff.
- Begin a long running process of copying existing rows from the original tables to the ghost table. Rows are copied in small batches.
- Capture or react to ongoing changes to the original table, and continuously apply them onto the ghost table.
- Monitor general database and replication metric, and throttle so as to prioritize production traffic as needed.
- When the existing data copy is complete, the migration is generally considered as ready to cut-over, give or take some small backlog or state of the replication topology.
- Final step is the cut-over: renaming away of the original table, and renaming the ghost table in its place. Up to some locking or small table outage time, the users and apps are largely ignorant that the table has been swapped under their feet.
In the following sections, we break down these steps to understand better the requirements, limitations, and correctness of the flow.
The schema migration flow never changes or updates the original table. We only ever operate on the ghost table. We create a ghost table with an exact copy of the original table's schema, then we run the user's
ALTER TABLE command on the ghost table.
First and foremost, we are able to validate the the statement is valid. For example, if the user requests an
ALTER TABLE foo DROP COLUMN bar, executing that statement on the ghost table successfully validates that the column
bar did, in fact, exist in the first place. Similarly we can catch any lower level errors such as an oversized index, invalid
DEFAULT values, etc.
This is also a place to check whether we want to enforce a change to
On top of validating the raw MySQL change, all Online Schema Change tools come with their own restrictions. A couple notable restrictions are:
Existence of relevant keys — All tools copy data from the original table onto the ghost table by slow iteration. The iteration utilizes a
UNIQUE KEY, ideally the
PRIMARY KEY. The different tools have somewhat different constraints, but they all require some form of
PRIMARY KEYto exist on the ghost table, maybe even require a shared key between the two tables. The existence of a key on the target table is required to satisfy the following:
- When applying an ongoing change on the original table, for example an
UPDATE, how do we efficiently find the row on the target table?
- When copying a chunk of data from the original table onto the ghost table, how do we resolve conflicts? By way of example, by the time we copy the last chunk of data, we may have already captured and applied the
INSERTchange, so that the rows already exists on the ghost table. More on that logic later on, but it is the
UNIQUE KEYthat helps us resolve the conflict.
- When applying an ongoing change on the original table, for example an
FOREIGN KEYconstraints — Foreign keys do not play well with the Online Schema Change flow, and there are either severe limitations on, or rejection of, tables that have foreign key constraints.
The tools compare the two tables to determine what makes them similar and what makes them different. Most importantly, the tools need to understand which columns are shared between the two tables that should be copied from the original table to the ghost table.
The tools need to take into consideration a potential rename of a column, or whether columns are
GENERATED, or turned from
GENERATED to normal columns or vice versa.
This is the single most time consuming part of the migration. Transferring all rows from one table to another can be a long running task even if running on a completely stale server. On large tables in production, the row copy could take as long as hours, days or even weeks.
Copy is done via a statement of the form:
INSERT [IGNORE] INTO <ghost_table> (<columns...>) SELECT <columns...> FROM <original_table> WHERE <range...>
It is impractical, or inefficient, to copy all rows from the original table to the ghost table in one pass. The tools work by iterating the original table start-to-end using some chosen
UNIQUE KEY (typically the
PRIMARY KEY). They only copy
n rows at a time, where common values are around
1000 rows per chunk, though based on row data size and on workload, these can also be on the range
We reasonably know where the table starts. But where does it "end"? Consider a table with an
AUTO_INCREMENT PRIMARY KEY column. As we work to copy the table rows, the app writes more rows into the table, pushing the end farther and farther. This creates an illusion of a never ending flow. However, this is not the case. The tools evaluate the first and last rows (as ordered by the iteration key), and only copy rows in that range. Any excessive rows beyond that range will be captured by following the ongoing changes. It is imperative then to begin capturing changes before evaluating row range.
100 rows at a time will very likely yield with an image inconsistent with a single snapshot of all rows. As we copy the first
100 rows, new rows can be added, removed or updated in the next
100 rows range. As we advance in our copy phase, the data gets more and more skewed from the original imaginary snapshot from the migration starting point. That is in fact not a problem, as the skew serves us to reflect an even more updated image of the data. We are interested not in the data snapshot at the start of the migration, but at the end of it. We want the ghost table to be aligned with the original table at such point when we can swap the two without data loss. Applying the ongoing changes is of course the key to achieving that. More on the interaction of row copy and ongoing changes later on.
It's simple to consider how we may iterate a table with, say, a
AUTO_INCREMENT PRIMARY KEY. It's easy to imagine iterating rows
[201..300] and so forth. But we must then also consider real life situations:
- We might have huge punctures in the
200..5000000may be altogether missing, and iterating that range could be extremely wasteful.
- Can we miss out on unforseen gaps? If we copied row
100, does it follow that the next row is
101? Could there be some
100.5? Not if the data type is an
INT, but what if it's another type?
- Our iteration key may not necessarily be numeric. It may in fact contain multiple columns. How can you iterate a
PRIMARY KEY (file_name, submitted_at)where
We illustrate below the pseudo code used by Online Schema Change tools to iterate a table using a key with any number and type of columns. For clarity, let's assume two columns
(file_name, submitted_at); the logic remains similar, though somewhat more detailed, the more columns we iterate by.
- Grab the minimum value via:
SELECT file_name, submitted_at FROM original_table ORDER BY file_name, submitted_at ASC LIMIT 1 INTO @first_entry_file_name, @first_entry_submitted_at`
- Grab the maximum value via:
SELECT file_name, submitted_at FROM original_table ORDER BY file_name DESC, submitted_at DESC LIMIT 1 INTO @last_entry_file_name, @last_entry_submitted_at`
(@previous_iteration_end_entry_file_name, @previous_iteration_end_entry_submitted_at) := (@first_entry_file_name, @first_entry_submitted_at)and repeat.
- For any iteration: evaluate the next range:
SELECT file_name, submitted_at FROM ( SELECT file_name, submitted_at FROM original_table WHERE (file_name, submitted_at) > (@previous_iteration_end_entry_file_name, @previous_iteration_end_entry_submitted_at) AND (file_name, submitted_at) <>= (@last_entry_file_name, @last_entry_submitted_at) ORDER BY file_name, submitted_at LIMIT 100 ) ORDER BY file_name DESC, submitted_at DESC LIMIT 1 INTO @this_iteration_end_entry_file_name, @this_iteration_end_entry_submitted_at
In words, get the last
file_name, submitted_at values counting
100 rows starting immediately after end of previous iteration, but also don't exceed our "end" position.
- Follow up with copy:
INSERT INTO ghost_table (<columns...>) SELECT <columns...> FROM original_table WHERE (file_name, submitted_at) > (@previous_iteration_end_entry_file_name, @previous_iteration_end_entry_submitted_at) AND (file_name, submitted_at) <= (@this_iteration_end_entry_file_name, @this_iteration_end_entry_submitted_at) ORDER BY file_name, submitted_at
The two queries could have been executed as one, but by first evaluating the range using only the key columns, this process is actually more efficient.
(@previous_iteration_end_entry_file_name, @previous_iteration_end_entry_submitted_at) := (@this_iteration_end_entry_file_name, @this_iteration_end_entry_submitted_at)and repeat.
The above pseudo-code is just a simplification of the process.
There are several extra considerations that go into the full process, such as:
- Care needed to capture the very first row in iteration.
- Missing from the above is identifying the stop condition (empty row set found).
- MySQL actually performs poorly with tuple comparison. The condition
(a, b) > (x, y)must actually be phrased as
a > x OR a = x AND b > y. This explodes fast the more columns are at play.
- Most implementations will actually use a
INSERT IGNORErather than
INSERT, to play well with ongoing updates.
Tools will capture ongoing changes by either utilizing database triggers or by hooking into the replication stream. This way or another, the changes must be applied to the ghost table.
As with row copy, we are only interested in copying a subset of columns that are shared between the tables. Again, care needed to consider renamed or
The ongoing changes always represent the latest and most up-to-date image of the data. As such, they must override any prior data already captured on the ghost table. As an example, consider we have just copied our first iteration of
100 rows, and then observe an
UPDATE on, say, row
32. This update must override whatever value we have already copied into the ghost table for row
32. To this effect, row copy and applying changes take lower and higher priorities, respectively, over conflicts.
Some of the tools achieve this like so:
- Row copy only issues a
INSERT IGNORE INTO <ghost_table> ... SELECT ... FROM <original_table>
- Changes applied via
REPLACE INTO <ghost_table> ...
Consider a counter example to our previous scenario. Imagine we're now intercepting a change to row
1024, but row copy has not reached that row yet. We apply the change. When we finally reach and copy the range that includes row
1024, that specific row does not get overwritten. Now, what if by that time of row copy the row has been updated again? Isn't the row copy now even more up-to-date than the existing
1024 row on the ghost table? It is, in fact. But if that is the case, then there must have been a change, an
UPDATE to row
1024, that we just haven't applied yet, and are yet to discover and apply via
Different tools tackle this issue differently. A synchronous solution like
pt-online-schema-change is always up-to-date with ongoing changes, while Vitess's VReplication takes a completely different approach by marking GTID positions that have been or not have been already covered. The above should serve as a good illustration to the flow, regardless of the actual implementation.
Most tools run some form of monitoring over the affected server or of its replicas, or both. Either sequentially to the copying process or concurrently, the tools may choose to throttle based on one or more metrics.
The single most important metric, by far, is replication lag. How the tools discover the replicas, choose which replicas to consider, respond to topology changes, is a matter of implementation and of a different discussion. However, the tools that do observe replicas all try to maintain as low replication lag as possible, and will step back once that lag grows. What the threshold is, is configurable. On extremely busy and sensitive systems this may be as low as
1 second. Other environments may tolerate minutes of lag.
Not all tools can throttle completely. Specifically, trigger based tools can only throttle the row copy process, but not the ongoing changes tracking. Non-trigger based tools are able to completely halt any processing, or choose to take lower precedence over production traffic. For more information about trigger based vs non-trigger based tools, check out our doc on Comparison of online schema change tools.
Generally speaking, in a normal and successful migration flow, the migration is ready to cut-over once row-copy is complete. Different tools may still have some backlog to process. Under heavy workload, the backlog may take substantial time to process.
However, the general state of the replication topology can also be a factor. It is a good idea to only cut-over when replication lag is minimal, to avoid schema confusion between primary and replicas. Also, the current workload on the primary plays an important part, as discussed in the next section.
By far the most critical point of the migration, this is where the changes take effect and the new, changed schema, tastes first production traffic.
It's first important to realize some migrations do not go as planned. Some drop the wrong column, or change the wrong index, and such is life. The cut-over point is where the user might expect trouble. When some queries suddenly fail, or others suddenly perform poorly. Some of the tools will cut-over automatically when they deem the migration is complete, and others will allow the user to choose the timing.
Replacing the table is also a point of contention. MySQL-wise, we need to acquire a table lock, also known as a metadata lock. As internal MySQL implementation goes, one cannot acquire a metadata lock while a query still runs on the table. This means a long running query may block the cut-over for a substantial time. This is why it's good to either consult with the running workload, or let the user choose the cut-over time.
Once metadata lock is acquired, application connections start noticing something goes on. Depending on the cut-over implementation, the metadata lock can be held up to a few seconds. During that time more queries will pile up, leading to increased number of connections on the server. It's not unreasonable that the number of connections will surpass the configured database connections threshold, at which time no new connections and queries are allowed. This is again obviously noticeable to the apps. Trigger based tools suffer from an even increased metadata lock, and under heavy workloads this can cause seconds or minutes long stalls.
Depending on the implementation, the tables will either swap atomically, such that the connections are suddenly unblocked and "magically" proceed to run on the new (previously ghost) table; or, there will be a "puncture" in time, where the table ceases to exist, which leads to query errors for a brief time.
At long last, the migration is complete with the ghost table assuming the role of the original table. Production traffic proceeds to run on the new table. The old table is then normally left for a while, to be dropped later on by the database engineer.
Most tools stop their operation and their interest at cut-over. Vitess, in particular, goes beyond that point to:
- allow the migration to be reverted
- clean up the left over tables
For more information about how Vitess handles schema changes, check out the Vitess Schema Changes documentation.