Single-node databases from $5 coming soon. Get notified
Navigation

Blog|Engineering

Report on our investigation of the 2025-10-20 incident in AWS us-east-1

By Richard Crowley |

On 2025-10-20, there was an incident that affected PlanetScale, initially caused by DNS misconfiguration in one of PlanetScale’s service providers, followed by several hours of capacity constraints and network instability. The incident occurred in two distinct phases, with the first affecting the PlanetScale control plane and the second affecting some database branches hosted in AWS us-east-1.

Our design focus on isolation and static stability put us in a good position to weather this incident with minimal impact. During the first phase of the incident, our control plane was impacted, but customer database branches remained fully available. During the second phase, some customer database branches in AWS us-east-1 were impacted by network partitions.

Phase 1

PlanetScale engineers were alerted to problems with our control plane at 7:13 UTC. Our continuous testing in production had started to fail. The tests cover a wide range of PlanetScale’s functionality but at this point were unable to create new database branches. Further investigation showed this was a near-total control plane outage. The service responsible for creating, resizing, and configuring database branches, which is hosted in AWS us-east-1, was unavailable. It depends on our internal secret-distribution service which depends on Amazon S3 which depends on AWS STS which was impacted by the Amazon DynamoDB outage.

Throughout this period, no database branches lost capacity or connectivity.

The PlanetScale dashboard was intermittently available during this phase of the incident. It’s hosted by a provider that, like the PlanetScale control plane, is hosted in AWS us-east-1. Additionally, PlanetScale customers using SSO were unable to login if they weren’t already.

Finally, during this phase we were unable to post updates to https://planetscalestatus.com, though even if we were the site itself was unavailable for at least half an hour.

PlanetScale engineers investigated thoroughly but did not take any corrective actions during this phase of the incident. Service was restored at 9:30 UTC after upstream service providers recovered.

Phase 2

About half an hour after, at 10:05 UTC, PlanetScale engineers were alerted that one of our Kubernetes operators in one of our customers’ large single-tenant installations was exhausting all its available resources. This typically means latency has increased or control plane requests are failing and being retried. Responders quickly identify that we were unable to launch new EC2 instances in us-east-1.

Customers could attempt to create or resize database branches but, because we could not launch new EC2 instances, these requests could not be completed; they remained queued until the incident was resolved. Their existing MySQL or Postgres servers remained available while requests to launch new EC2 instances were queued.

Given that the US East Coast was about to start their Monday, the inability to launch new EC2 instances presented a risk to some of our largest customers who use diurnal autoscaling for the vtgate component of their Vitess clusters. Some were going to be coming into their peak weekly traffic with less than half the vtgate capacity they had the week prior.

PlanetScale engineers made several interventions to minimize the number of EC2 instances we needed to launch in AWS us-east-1:

  • Temporarily disallowed creating new databases in AWS us-east-1 and changed the default region for new databases to AWS us-east-2.
  • Delayed scheduling additional backups and canceled pending backups that were waiting to launch an EC2 instance. (PlanetScale’s standard backup procedure launches an additional replica which restores the previous backup and catches up on replication before taking a new backup to avoid reducing the capacity and fault-tolerance of the database during backups.)
  • Advised PlanetScale Managed customers using vtgate autoscaling to shed whatever load they could by e.g. delaying queue processing or pausing ETL processes.

We also took steps to avoid terminating any running EC2 instances:

  • Paused our continuous process of draining and terminating EC2 instances more than 30 days old.
  • Stopped terminating any EC2 instances that became vacant, instead holding them for reuse.

The most important intervention, though, was to temporarily change how we schedule vtgate processes for customers with autoscaling configured. We bin-packed vtgate processes more tightly than usual, running closer to CPU capacity than is typical, in order to provide ample capacity for the US work day.

Alongside issues launching EC2 instances, we observed partial network partitions in AWS us-east-1. Impact appears to begin around 14:30 UTC; however we can’t know if fewer queries reached PlanetScale because of a network partition or because fewer queries were sent due to upstream customer impact. These partitions healed gradually between about 18:30 and 19:30 UTC. During this time, some database servers were reachable from the Internet but couldn’t communicate across availability zones for query routing, replication, or both. Some replicas could reach container registries when they started up but could not replicate from their primary MySQL or Postgres. Some servers had trouble resolving internal DNS names and others had trouble connecting to the internal services those DNS names resolved.

The network partitions caused a significant percentage of some customers’ queries to fail. Not all database branches were affected as the impact depended heavily on which availability zones were in use and whether traffic was crossing between zones. Where possible, we manually sent reparent requests to move primary databases to availability zones known to be healthier or known to be colocated with the customer’s application.

Once the network partitions healed, we found a small number of processes (PlanetScale’s edge load balancer as well as vtgate) which were not able to recover on their own due to the way they experienced the network partition. We restarted these and restored service.

PlanetScale’s incident commander declared the incident was resolved at 20:32 UTC.

Reflecting on our resilience

PlanetScale weathered this incident well. Strong separation of control and data planes meant an outage in our control plane did not affect our customers’ databases. Redundancy and battle-tested automated failover allowed primary database servers to move to the majority side of network partitions. Careful zonal traffic routing avoided network partitions as well as possible.

But every incident offers opportunities for improvement. We are taking steps to better understand and become resilient to the failure modes of SaaS we depend on, including for CI/CD, SSO, Web application hosting and incident communication.

We are investigating more ambitious ways to reduce our runtime dependence on both internal and AWS services.

Network partitions are one of the hardest failure modes to reason about, test, and tolerate. Per AWS's Well-Architected Framework, the use of three availability zones allows us to tolerate the failure of one but only if network connectivity between the other two remains reliable. AWS us-east-1 happens to have six availability zones and we’re looking into how PlanetScale can better use them all to become more resilient to both zonal outages and network partitions between them.

If you want to read more about how we engineer for resilience, read PlanetScale’s Principles of Extreme Fault Tolerance.