Upgrading Query Insights to Metal

By Rafer Hazen | March 11, 2025

Since PlanetScale Metal is now generally available, I wanted to share a post describing our experience migrating the PlanetScale database that powers the Query Insights feature to Metal.

Data collection in Query Insights

First, a bit about how we collect the data for Query Insights. The basic steps are:

Collect query-pattern telemetry from the Vitess layer of your PlanetScale database
Publish the data to Kafka
Consume the data from Kafka and write to several MySQL tables, aggregated by time.

The primary scalability concern in the Query Insights pipeline is ensuring that we can process and write data to the database quickly enough to keep up with the inbound volume. To accomplish this, we read Kafka messages in batches, coalesce data in memory to avoid unnecessary writes, and hand the writes off to a thread pool in each Kafka consumer.

Pre-Metal: Write-heavy with provisioned IOPS

The result is that the Query Insights database is very write-heavy. As of this writing, we execute approximately 10k UPDATE/INSERT statements per second. These writes come from 32 consumer processes, each with 25 writer threads for a total max concurrency of 800 threads.

The Query Insights PlanetScale database has 8 shards and, prior to our upgrade to Metal, we'd had to provision more IOPS to the EBS volumes backing MySQL in our sharded keyspace to keep up with the telemetry volume. Since this workload had demonstrated a sensitivity to I/O latency, we figured it would be a good candidate for upgrading to Metal.

Performance improvements after migrating to Metal

To do this, we picked 1 of our 8 MySQL shards, the busiest one, to upgrade first.

The following graphs show the query latency at various percentiles. The lines shows the latency for the 8 primaries of the Insights database. The purple line corresponds to our busiest shard, which was upgraded to Metal around 19:35.

Insights Metal Upgrade Shard Latency P50

Insights Metal Upgrade Shard Latency P90

Insights Metal Upgrade Shard Latency P95

Insights Metal Upgrade Shard Latency P99

Upgrading a test shard to Metal causes a substantial decrease in latency across all the measured percentiles. After the Metal upgrade, our busiest shard with the highest latencies started executing queries faster than the other shards by a significant margin.

After letting the first upgrade soak for a few days, we upgraded the remaining shards and saw nearly identical improvement in performance.

Without making any changes to our application, architecture, or sharding configuration, we were able to realize substantial performance improvements by upgrading to PlanetScale Metal. This resulted in a lower average backlog in our Kafka consumers, and has given us additional capacity to handle increasing message volume in the future.