Scaling a database presents challenges. As you grow, you might begin having trouble handling ever-increasing throughput or data size. You might find that query latency is getting worse. You might be pushing the limits of your hardware. When this happens, a classic option is vertically scaling your database by getting better hardware, but is there a better way? And what happens when you reach the vertical limits?
This is where horizontal sharding comes in. In this article, we'll cover some common indicators that your database may be ready for horizontal sharding. We'll also look at some measures you can implement until then. Let's dig in.
There are lots of different limits that you can run into when you're scaling up. At the database level, you might be maxing out CPU, memory, disk space, or IOPS.
Running into these limits can have real consequences for your business. Database operations like schema changes will start taking longer, making it harder to ship new features. Query latency will increase, leading to sluggish responsiveness. As things get worse, hitting these limits leads to incidents and you start facing outages.
Before you begin to think about sharding, let's make sure you've first exhausted some of the other options.
After a single server is maxed out, you need to spread the load across more nodes. There are several approaches that you can use.
A tried-and-true method for scaling MySQL is using replicas for reads. In addition to setting up the replicas, this involves application changes to split reads and writes to different connection strings. Most web applications are very read-heavy, and this method allows you to continue scaling reads by adding more replicas.
After that, another strategy is adding more MySQL clusters by segmenting logical groups of tables. This means taking all of the tables used by a certain service or product area (for example, users or notifications) and separating them into a new cluster. We sometimes call this vertical sharding or vertical partitioning. The diagrams below show what it would look like to break a cluster containing
notifications tables down into vertical shards by moving
notifications to a separate cluster.
While vertical sharding is a viable option, it does come with some downsides. In addition to the application changes required for these new connection strings, there may be more complex changes to account for the fact that, without a framework like Vitess, you would be unable to perform JOINs between tables that now live on different servers.
Having broken down your databases into their smallest logical groups of tables, you may find yourself in a bit of a jam when one of those clusters starts hitting limits. This is where horizontal sharding comes in.
Horizontal sharding differs from the vertical sharding described previously. Instead of splitting up a cluster by moving whole tables elsewhere, with horizontal sharding each underlying cluster shares the same schema and has different rows distributed to it.
For a more in-depth guide about what sharding is and how it works, check out What is database sharding and how does it work on our blog.
Historically, you needed to be one of the largest webscalers in the world to require sharding, and when you hit those limits, you had to build it yourself. Examples of these include TAO at Facebook, Gizzard at Twitter, and Vitess at YouTube. Sharding was a last resort after you'd exhausted all other options and still needed to handle growth.
Today, we think about it differently. Since its creation at YouTube in 2011, Vitess has become a widely adopted open source solution that has made sharding much more accessible. Sharding is no longer a last resort, and in fact, if adopted earlier, can help you avoid other larger application changes.
So, how do you know when to shard your database? Some good indicators that it may be time to consider sharding are when you've started to max out data size, write throughput, and/or read throughput. Let's walk through each of these categories.
One of the original reasons to shard was because disks were not large enough to hold all of your data. These days, that's not the problem. For example, Backblaze has recently purchased over 40,000 16TB drives!
Data size can still be a driving factor for sharding though. One thing to consider is how large your working set is, and how much of that fits into RAM. As less of your active data fits in memory and more queries need to read from disk, query latency will increase.
Other database operations are also affected by the data size of a single MySQL cluster. The larger the database, the longer backups (and restores!) take. The same is true for other operational tasks like provisioning new replicas and making schema changes. This is the logic behind guidelines Vitess has made for shard sizing. Smaller data size per shard improves manageability.
Another reason to consider sharding is when you've maxed out the write throughput of your cluster. This can show up in a couple of ways.
When the primary is maxed on IOPS, writes will become less performant. Usually before that, however, replication lag becomes a problem. While there have been significant improvements in replication within MySQL clusters, there will always be a small amount of delay between the time the data is written to the primary and that same data is written to a replica. You may be depending on replicas being up-to-date for disaster recovery, or you may be using replicas to scale out your reads as discussed earlier. When replicas fall behind the primary, this can look like inconsistent or stale data to your users, and may also result in errors if your application expects to be able to read data that it has just written.
When you're hitting your write throughput limits, other database operations like schema changes and batch jobs will be slower as well.
While running out of read throughput capacity can be solved through read-write splitting and the addition of read replicas, that isn't without its own challenges. As mentioned in the previous section, replication lag can make this complex or lead to a poor experience for your users.
Typically, this is earlier than we often think about sharding. However, by scaling read capacity through horizontal sharding instead of by using replicas, application code does not need to account for the potential replication lag or that multiple connection strings need to be managed and utilized depending on the data set you are trying to access. Plus, sharding at this stage sets you up for future growth and you don't have to come back and shard later when write throughput or data size would otherwise become an issue.
Sharding can and should be considered as a solution not just for scaling large data sizes, but also for scaling throughput of reads and writes.
In addition to being able to handle larger workloads, sharding provides other benefits, including:
- Better cost optimization as you scale.
- Increased parallelization leading to faster backups, restores, and schema changes.
- Better SLAs by reducing the blast radius of various failure domains to a single shard.
- Predictable horizontal scalability.
If you're unsure whether or not you're ready to shard, don't hesitate to contact us. We'd love to hop on a call to discuss your workload and scaling options.