The relational model has been in existence for over forty years, a rare feat in the software development world. Relational databases commonly serve as backends for small, medium, to the largest apps and products in the world. And while relational databases have optimized well for speed, concurrency, latency, and overall read/write performance, they have not adapted as much for metadata changes at scale. Specifically, many organizations are struggling to keep development velocity, agility, and confidence when deploying schema changes.
Thirty years ago, developers would plan a schema change months ahead. One would only deliver a handful of changes a year. Developers would work with the database administrators to approve schema changes and plan the transition into the new model. Companies would take the system down for maintenance, sometimes for hours or days, to apply those changes.
The current landscape
Today, those maintenance windows are unacceptable to most organizations. Users expect services to be highly available and operational around the clock. On the other hand, today’s developers are used to accelerated deployment flows and want to continuously deploy schema changes, also known as schema migrations, sometimes multiple times per day.
But relational databases have not stepped up to meet developers’ needs. Schema changes pose an operational barrier to continuous deployment and remain alien to developers’ flows. As a result, developers regard relational databases unlike other production systems and evolve patterns to try and minimize schema migrations or avoid them altogether by modifying their code in suboptimal manners. Schema deployments for large tables frequently remain a manual endeavor and are considered risky operations.
We believe this to be the result of negligence; that relational databases can and should meet modern development practices for schema deployments, thus allowing for more automation, control, velocity, and, as result, confidence in the process.
The suggested paradigm
We believe the following core tenets to be essential to schema migrations.
Some relational databases, and for some types of schema migrations, place a write lock on the migrated table, effectively rendering it inaccessible to the app. This in turn commonly manifests as an outage scenario. An ALTER TABLE migration for large tables can be measured in hours or even days. These blocking migrations are unacceptable to modern development flows and modern apps, and databases must offer non-blocking migrations that allow full access to the migrated table throughout the operation.
Even when available, non-blocking schema changes are typically aggressive in resource consumption and will attempt to utilize as much disk IO operations, memory, and CPU to run to completion. This competes with resources needed by the apps and often leads to degraded app performance. Schema changes should be able to yield to the app’s needs.
Atomic or transactional migrations are appreciated, but they imply a connection to be held active for the duration of the migration, which we measure by hours or days. Deployment tools, or even scripts, should not be required to hold on to those connections for such long periods. The behavior upon connection loss is normally not what the developer wants. Databases should be able to receive a schema change request and move to run it asynchronously.
Migrations may conflict with each other, either due to running on the same tables or simply because of the excessive resource consumption incurred. Databases should provide a mechanism for scheduling migrations. The database should be able to determine which migrations are safe to run concurrently and which are not.
Even if lightweight, a migration still has an impact and footprint. Disk space and disk I/O operations are most notable. It is sometimes required to stop that impact. It should be possible to interrupt a running migration at no immediate cost. A several hour long rollback or flushing of pages are examples of undesired cost, at a time resources are needed the most.
The database should be able to provide an estimate of a long migration’s progress or ETA.
A database should be able to resume a migration interrupted due to database failure. As an example, it should be possible for an operator to reboot the database server without compromising a days-long migration. The same is true for unexpected failures. Operations teams should not postpone maintenance work due to developer’s deployments, and developers should not withhold deployments due to planned operations.
If a database offers a multi-node design, i.e., a cluster of nodes, then migrations should be agnostic to cross-node failovers and should not be bound to the specific node where they started.
Schema migrations should be treated as first-class deployments. As such, the database system should be able to undeploy a migration, thus restoring the pre-migration schema. Developers should have the confidence that if a schema deployment goes wrong, they can revert it and go back to a known good state.
Much like code deployments, schema deployments should be idempotent. The developer or the deployment system should be able to submit the same migration request twice (or more) in a row, and the database should resolve the excessive requests to ensure the migration runs once, as the developer would expect.
Databases should potentially support declarative schema deployments, where a developer submits a desired state rather than an imperative command. Declarative schema deployments are idempotent by nature.
The resulting flow
With these principles in place, developers have the confidence that their schema migrations will not put substantial load on production servers. That their deployment tools will not have to block for hours while running the change. That the database will gracefully schedule their migration while other deployments are in place. They can track the progress of the migration at any time and interrupt it if the need arises, at no additional cost.
Developers are free from operational considerations. They do not need to be concerned about planned maintenance or unplanned failovers.
They can feel confident in their deployments knowing they can redeploy their change, again and again, or revert it altogether and go back to the last known state in case of trouble.
These all suggest a relaxed development flow that gives developers ownership of their schema changes and the confidence to deploy with velocity.