The principles of extreme fault tolerance
By Max Englander |
PlanetScale is fast and reliable. Our speed is the best in the cloud due to our shared nothing architecture that enables us to utilize local storage instead of network-attached storage. Our fault tolerance is built on top of principles, processes, and architectures that are easy to understand, but require painstaking work to do well.
We have talked about our speed a lot. Let's talk about why we are reliable.
Principles
Our principles are neither new nor radical. You may find them obvious. Even so, they are foundational for our fault tolerance. Every capability we add, and every optimization we make, is either bound by or born from these principles.
Isolation
- Systems are made from parts that are as physically and logically independent as possible.
- Failures in one part do not cascade into failures in an independent part.
- Parts in the critical path have as few dependencies as possible.
Redundancy
- Each part is copied multiple times, so if one part fails, its copies continue doing its work.
- Copies of each part are themselves isolated from each other.
Static stability
- When something fails, continue operating with the last known good state.
- Overprovision so a failing part's work can be absorbed by its copies.
Architecture
Our architecture emerges from the principles above.
Control plane
- Provides database management functionality. Database creation, billing, etc.
- Composed of parts which are redundant, spread across multiple cloud availability zones.
- Less critical than the data plane, and so has more dependencies.
- E.g. uses a PlanetScale database to store customer and database metadata.
Data plane
- Stores database data and serves customer application queries.
- Composed of a query routing layer and database clusters.
- Each of these parts are both regionally and zonally redundant and isolated.
- The most critical plane, with fewer dependencies than the control plane.
- Does not depend on the control plane.
Database clusters
- Composed of a primary instance and a minimum of two replicas.
- Each instance is composed of a VM and storage residing in the data plane.
- Instances evenly distributed across three availability zones.
- Automatic failovers from primaries to healthy replicas in response to failures.
- Customers may optionally run copies in read-only regions.
- Enterprise customers may optionally promote read-only regions to primary.
- Extremely critical. Extremely few dependences.
Processes
Within this architecture, we apply processes that reinforce our systems' overall fault tolerance.
Always be Failing Over
- Very mature ability to fail over from a failing database primary to a healthy replica.
- Exercise this ability every week on every customer database as we ship changes.
- In the event of failing hardware or a network failure - fairly common in a big system running on the cloud - we automatically and aggressively fail over.
- Query buffering minimizes or eliminates disruption during failovers.
Synchronous replication
- MySQL semi-sync replication, Postgres synchronous commits.
- Commits stored durably on at least one replica before primary sends acknowledgment to the client.
- Enables us to treat replicas as potential primaries, and fail over to them immediately as needed.
Progressive delivery
- Data plane changes are shipped gradually to progressively critical environments.
- Database cluster config and binary changes are shipped database by database using feature flags
- Release channels allow us to ship changes to dev branches first, and to wait a week or more before shipping those same changes to production branches
- Minimizes the impact of our own mistakes on our customers.
Failure modes
How adherence to the principles, architecture, and processes above enable us to tolerate a variety of failure modes.
Non-query-path failures
- Because our query path has extremely few dependencies, failures outside of the query path do not impact our customers' application queries.
- As an example, a hypothetical failure in one of our cloud providers' Docker registry services might impact our ability to create new database instances, but will not impact existing instances' ability to serve queries or store data.
- Likewise, failures, even total failure, of our control plane would impact our customer's ability to change their database cluster's settings, but would not impact that cluster's query service.
Cloud provider failures
We run on AWS and GCP, which can and do fail in many different ways.
Instance
- If a failure impacts a primary database instance, we immediately fail over to a replica.
- If a block storage database instance has a failing VM, the elastic volume is detached from that VM and reattached to a new, healthy VM.
- If a PlanetScale Metal database instance has a failing VM, we surge a replacement instance with a new VM and local NVMe drive, and destroy the failing instance once its replacement is healthy.
- A storage failure is handled roughly the same way for block storage and Metal clusters: we spin up a replacement database instance and scale down the unhealthy instance.
Zonal failures
- As with instance-level failures, if a primary database instance resides in an availability zone that is failing, we immediately fail over to a replica in a healthy availability zone.
- Our query routing layer reacts to zonal failures by shifting traffic to instances in healthy zones.
Regional failures
- If an entire region goes down, so do database clusters running in that region.
- However, database clusters running in other regions are unaffected.
- Enterprise customers have the ability to initiate a failover to one of their read-only regions.
PlanetScale-induced failures
- A bug in Vitess or the PlanetScale Kubernetes operator rarely impacts more than 1-2 customers, thanks to our extensive use of feature flags to roll out changes.
- A failure resulting from an infrastructure change, like a Kubernetes upgrade, can have a bigger impact, but very rarely does because of how rigorously we test and gradually we roll out.