Database Sharding
By Benjamin Dicken |
What is sharding
Sharding is the process of scaling a database by spreading out the data across multiple servers, or shards. Sharding is the go-to database scaling solution for many large organizations managing data at petabyte scale. Your favorite companies like Uber, Shopify, Slack, and Cash App all use sharding with Vitess and MySQL to scale their massive databases.
In this article, you'll learn how sharding works and considerations for designing a performant sharded database cluster. Along the way, you'll be able to interact with database cluster diagrams, giving you the opportunity to really let the concepts sink in via hands-on examples.
Note
At PlanetScale, we have tools like Cluster Configuration and Workflows that make sharding much easier than most other solutions. At PlanetScale, we offer powerful tools that allow you to shard tables straight from the UI using our sharding workflows. Built on top of Vitess, this allows you to horizontally scale without adding additional logic to your application code. Sharded tables appear as a single, unified table to your application. Sign up to try it out yourself or check out our video walkthrough if you want to see how it works on PlanetScale.
Sharding basics
Most small-scale web applications will have one or more application servers that connect to a single, monolithic database server. The applications store all persistent data on this single server, and send queries to it to meet application needs. This includes user account information and whatever other data the application needs in order to operate.
Below is an interactive database cluster with this setup. There is an application server on the left and a single database server on the right. Try clicking insert row and select row to see data get added to and retrieved from the database.
Each row inserted has a user_id
, name
, and age
respectively. We'll use rows with this formatting for several examples going forward.
As a quick aside, you can configure the speed of animations in this post by selecting your desired speed below.
The architecture described above works well for low-demand systems that only need to make a few hundred or a few thousand queries per second.
However, popular software apps often have hundreds of thousands of concurrent users. In some cases, these applications need to store petabytes of information in the database and process millions of queries per second at peak hours. Such huge workloads demand spreading out our database across many servers, rather than just one. Sharding is a popular solution to this problem.
In a sharded database, we will have multiple separate database servers, each with a portion of the total data. Below you can see a simple configuration with an application server that can send queries to one of two database servers (shards). Try inserting and selecting rows again. For any of the remaining interactive visualizations, you can also click on a row in a shard to delete it or click on a shard's label to empty it completely
With this setup, the code running on the application server has to be aware of all of the shards, know which rows are stored where, and keep a connection open to each. This is not a huge problem with only two shards, but it becomes complex when there are hundreds of them. Storing this logic in the application code can quickly become messy and difficult to maintain.
A better option is to have the app servers connect to an intermediary server that we call a proxy. When an application server needs to use the database, it will send queries to a proxy. The proxy is then responsible for routing the query to the correct shard server
We have such a configuration below. Go ahead and click the insert row button to add rows to the shards. You'll see visually how the proxy server is used to manage the inserts.
Note
PlanetScale builds its sharding solution on top of Vitess. In Vitess, these proxy servers are known as Vitess Gates
, or VTGates
for short.
After filling up the shards, you can see that there are 6 rows in total, each database server storing 3. Of course, a real sharded database would store many millions or billions of rows per server, but we keep it small here for the sake of learning.
If you inserted more than six rows, you may noticed one or more of the shard servers pulse red indicating that they are being overloaded. If you inserted really fast, you may have seem the proxy server pulse red as well. Server overloading can happen when the query demand is too high or if the server is nearing its storage capacity. As the amount of data grows, we can add more shards to support it. This is a process known as resharding. Let's increase this database cluster to have three shards instead of two.
It's also important to consider that the proxy needs time to process a query as well. Consider the below sharded configuration. We have lots of database capacity, but only one proxy server. Try inserting a bunch of rows in quick succession.
You should see that the proxy flashed red, and inserts were queued up beneath it rather than all being handled at once. This is because the proxy server hit the capacity for simultaneous queries it could process, and had to queue up other inserts. This added latency would be unacceptable for a production database system. To get around this, we can add more proxy servers. Try clicking insert row multiple times again.
Since we have more proxy capacity, we have reduced the likelihood of needing to queue a query. When designing a large-scale sharded database system, it's important to understand the demands and choose the correct number of proxies and shards to handle the workload, even at peak times.
Note
Without the right tooling, the process of migrating a database from unsharded to sharded can be challenging. Some organizations spend months transitioning from an unsharded to sharded architecture. Thankfully, PlanetScale has built tools like Cluster Configuration and Workflows that make sharding a MySQL database easy, even if that database is serving live, production traffic.
Sharding strategy
Now you know the basics of what a sharded database is. The next question is: how does the data get sharded?.
One of the most important considerations in a sharded database setup is the sharding strategy. The sharding strategy is the set of rules used to determine which rows of data go to which shards. In the above examples, the shard each row was sent to was chosen at random. Typically, your sharding strategy will involve selecting a shard key: the column(s) that you use to determine where each row will reside. The sharding strategy and shard key you choose have a huge impact on how evenly the data is distributed across shards as well as query performance.
Let's consider several options and weigh their pros and cons.
Range sharding
In a Range sharding strategy, the proxy layer decides where each row should go based on pre-defined ranges of values.
For example, say we have four shards and want to use the user_id
(the number in the first column of each row) as our sharding key. In such a setup, we could say that all rows with ID 1-25 go to the first shard, 26-50 to the second shard, 51-75 to the third shard, and 76-100 to the fourth. The database cluster below is configured in such a manner. Try inserting some rows What do you notice?
The first 25 inserts all go to the first shard, leading to one hot shard (a shard that is over-worked) and three other cool shards (under-worked). If we continue inserting, the same problem arises for all the other shards. Using naive range-based sharding with IDs is generally a bad idea if our IDs are monotonically increasing, as we have here.
Let's try range sharding with the second column, name
. Here, we'll put all names a-f on the first shard, g-m on the second, n-u on the third, and v-z on the fourth. Let's try more insertions
Uh-oh, more problems! None of our users have names in the v-z range, leading to a wasted shard. Such a sharding solution only works well if our users have names that are perfectly evenly distributed across the alphabet. This is rarely true in practice. The letters that names begin with varies based on factors like name popularity and nationality.
Finally, let's try using range sharding on this third column, age
. Here, we'll put all ages 0-24 on the first shard, 25-49 on the second, 50-74 on the third, and 75-200 on the fourth. Go ahead and insert a handful of rows
Again, we run into problems here. The vast majority of our users are between 25-74 years of age. Two of our shards are hot with lots of traffic while the other two are quite cold. We could tweak these ranges to better fit our audience, but what happens as the ages of our users shift over time? There must be a better way.
Hash sharding
One of the most popular sharding strategies is hash sharding. In hash sharding, we choose a column to be our shard key and we generate a cryptographic hash of this value for each row that needs to be inserted. Each shard is responsible for storing the rows for a range of hashes, and this process is controlled by the proxy servers.
For the purpose of example, let's say we will run the shard key column through a simple algorithm that always produces a hash between 0 and 100. Hashes 0 - 25 go to the first shard, 26-50 go to the second, 51-75 to the third, and 76-100 to the fourth. The nice thing about hashes is that similar inputs can produce very different outputs. We might pass in the name "joseph"
and get hash 45
, but the name "josephine"
produces 28. Similar names, completely different hashes. This means similar values may end up on totally different servers, a good property to help the data get evenly spread out.
Below we have a server configured to use hash sharding on the name
column. Insert a few rows to see how things get distributed.
This seems to work well, but is it optimal? With hash sharding, how do we know what column to choose as our shard key?
Ideally, we want something with high cardinality. The name
column is not the ideal choice. There may be very popular names, and even with hashing we might end up with hotter servers than others.
Often a column like user_id
is a good choice because each value is unique. We also get the added benefit of hash speed. It's faster to hash a fixed-size integer as compared to a variable-width name string.
Here's an example of using hash sharding on the id column instead.
We still achieve good data distribution, but now with a high-cardinality and fixed-size value.
Other strategies
Range and hash sharding are good choices that cover many scenarios. However, there are other options. Software like Vitess also supports lookup sharding and even custom sharding functions.
In lookup sharding, the developer can set up a table that contains information needed to map incoming data to the appropriate shard. This table is referenced when queries for such tables come in.
Vitess also supports custom sharding functions, allowing developers to write their own functions that take column(s) as input, and produces custom output to let the system know which shard(s) need to be used.
Cross-shard queries
Another important consideration when designing a sharded database cluster is how many shards are required to fulfill each query. The ideal is that most individual queries can be handled by a single shard. Consider the following query to the user
table with a user_id
, name
, and age
column as described earlier:
SELECT name, age FROM user WHERE user_id = 1;
Assuming user_id
is our primary key, this query will only select one row, which should only live on one shard. That means that the proxy layer only needs to ask one shard to get the answer for the query. If you tried clicking the select row button in some of the earlier interactive examples, you should have already seen this as it was only selecting individual rows.
Let's now consider an additional steps
table in our database. This table is used for storing per-day step-count statistics tracked by a health app on our user's phones and watches. Each row has a user_id
, step_count
, and date
, respectively. We also want to shard this table.
A common use for this data may be for a user to request their step history to view it on a line chart. In SQL, the query to fetch this data would look like this:
SELECT step_count, date FROM steps WHERE user_id = 1;
This query may return many rows. How many shards will be needed for the proxy to find all these rows? It depends on how we sharded the steps
table to begin with!
Let's consider what would happen if we had done range sharding on the step_count
column (the second column). Try inserting some rows and then click select where id == 1 to get all of the entries for user with id = 1
.
As you can see, the SELECT
query may have to spread across multiple shards to make this work! This is known as a cross-shard query. Cross-shard queries are bad for system performance, and should be avoided whenever possible. When multiple shards need to fulfill a single query, this adds excessive network and CPU overhead to the database cluster.
A better solution would be to shard based on user_id
. That way, all entries for a given user will live on the same shard, which will allow us to avoid this cross-shard query. Give it a go below:
This will lead to much better performance for this query.
Updates
Yet another thing to consider when deciding on a column to be your shard key is update frequency. Any time the value in a column of a shard key is changed, it may need to be moved from one shard to another in order to maintain the integrity of our sharding strategy.
Consider again the choice between using step_count
or user_id
as a shard key. A step_count
is a column that may be volatile. Throughout a day, the step count will change as the user continues to walk. We may even want to give a user the ability to manually update the step_count
of previous days. Each time a change occurs, the database cluster has to re-evaluate which shard the row belongs on. This means a single row may end up moving around to different shards over time.
Compare that to sharding on user_id
. Once we give a user a unique identifier, it is rarely changed. Therefore, once we determine the shard for a row, it will stay on that shard. The only time we will need to move it is in situations where we need to grow or shrink our number of shards.
Always take time to consider the volatility
of a column before selecting it as a shard key.
Latency
Adding a proxy layer does come with a downside: added latency. By introducing the proxy, there is an additional network hop for requests coming in to our database. Consider a request from the app to the database and back without a proxy.
Now, compare that to the time it takes to make this same round-trip but with a proxy in the middle.
Clearly, it takes longer!
However, this problem can be minimized with proper consideration for server location. If the proxy and shards all live in the same data center, the added latency can be brought down to 1ms or less. For the vast majority of applications, adding 1ms is worth the scalability achieved with the sharded architecture. For example, Slack runs massive sharded database cluster with Vitess, and reports an average query latency of only 2ms.
Data durability
A replica is a database server that is connected to the primary database server and replicates all data from it. Whether you have gigabytes or petabytes of data, running replicas is always a good idea.
For one, they increase the durability of your data. If the main server goes down, you still have copies of your data on the replicas. Adding replicas decreases the possibility of losing your database due to hardware failure.
Below is an example of a sharded database cluster that is using replication. Each shard has a single primary and then two additional replicas per-shard for data durability.
Replicas are also useful for keeping your system highly available. If your primary server goes down and you have no replicas, your application could experience hours or even days of outage while the server is fixed or replaced and then brought back online. If you have replicas configured, traffic can be switched over to one of them, getting your app back in the action immediately.
PlanetScale runs on the open-source sharding solution Vitess. Vitess allows you to build sharded database clusters with customizable number of replicas per-shard. It also has an orchestration component, which can automatically detect server failures and quickly replace downed primaries. This keeps your data safe while maintaining high-availability.
Fast Backups
For large-scale databases, backup time can easily get out of hand.
Say you have an unsharded 4 terabyte database running on a single server. Now, say our network dictates that a backup can run at 100 MB/s per-server. In this case, backing up this database will take (4TB / 100MBps * 60s * 60m) = ~11
hours. That's a long time!
Hit the run backup button below to see a small-scale example of this.
Alternatively, what if this 4TB was spread out across 4 shards storing 1TB each? Each individual server can capture a backup simultaneously at 100MBps, allowing the cluster to back up the data at 400MBps, taking only ~2.7
hours. Hit the run backup button again.
Notice how much faster the backup completes. This is all thanks to spreading our data out across multiple shards
For more details about how sharding help switch backup performance, check out this blog post. Sharding your database comes with a number of benefits, not all of which are covered here. Check out our other blog post for more details.
Conclusion
Sharding is an excellent solution for scaling a database system. However, getting a sharded database to perform well requires careful attention to sharding strategy, shard key selection, and query optimization. The concepts covered here lay the foundation to help you build a quality sharded database using technologies like Vitess and PlanetScale.
If you're interested in getting up and running with a sharded database, you can sign up for a PlanetScale account and try it out yourself by following our Sharding quickstart. If you'd like hands-on support or to explore a custom plan with consumption discounts, contact us.