Selecting the right shard key is super important for a well performing database.
Say that you decided to shard the message
table based on a hash of the content of the message itself. Every message that the database receives would get hashed, and each server would be designated to handle some range of the hashed results. This technique might work well for evenly distributing the data across many servers. However, it also would lead to bad performance for some types of queries. Since the shard key likely has nothing to do with username or user ID, many of the messages sent by any single user would be spread across multiple shards. If a user tries to view their full message history, this will trigger a query to collect all messages sent by that user, which will in turn require work across these shards. If instead we had decided to shard based on the message sender ID, then all messages sent by any given user would live on one server.
The sharding key you choose can also have a big impact on how the data is distributed. Generally, you want each shard to store approximately the same amount of data. You want to avoid data hotspots where, say, one shard is storing 1 terabyte, while your three other shards are storing much less. This type of situation could occur if you decided to shard the player
table based on rank
. In GalaxySurfer, players can have any rank between 0 and 1000. You could configure the players table to be sharded across four shards and distributed this as 0-250, 250-500, 500-750, and 750-1000. However, what if the vast majority of your players are a low rank (between 0-250) and there progressively becomes less and less players as the rankings get higher? One shard may have 500 gigs, then the next shard 200 gigs, the next 100, and the last may only have 25 gigabytes if there are not very many high ranking players.
In addition to data hotspots, there can also be query hotspots. You can end up in a situation where the data is well distributed, yet one physical shard is getting a higher number of requests than all of the others.
Generally, we are going to shard a table based on either one or a collection of the columns in that table. When selecting the column(s) and the sharding strategy, there’s a few guidelines to consider.
One component is the cardinality of the column. If you choose a column with a large number of unique values, this can help lead to good data distribution.
You also should have a good understanding of your query set. You should select a shard key that aligns with the most common queries sent to the database.
Finally, the shard key should be chosen to be a value that does not change often, if ever. If the column you are sharding on has a value that can be changed at a later time, this would logically mean that that row would need to be moved to a different server.
Range-based sharding is simple to understand, but is also prone to hotspots. With this method, you specify the range of values for each shard in your cluster.
Another option that generally makes for a solid sharding choice is using a hash function. In hash-based sharding, you select one or more columns that will be run through a mathematical algorithm (hash function) to determine what shard a given record belongs to. When using a hash function to shard data, the result of the hash should be deterministic and consistently produce the same results for a provided input.
Another option is to use a directory-based sharding scheme. In directory-based sharding, you’d have a separate service or table within your database that would be used to lookup what shard a given row lives on based on the shard key.
Geographic sharding is a good solution if your users are globally distributed. Geosharding is the concept of selecting a key based on geographical location of a user. When using this method, your shards are typically located within those geographical region as well, reducing the latency for users since the data is physically closer to them.
This is not a comprehensive list of sharding options, but hopefully this gives you a good sense of some of the option available to you.
PlanetScale uses Vitess, which has several built-in options for sharding, as well as the ability to configure custom sharding. For more information about sharding with Vitess, see the documentation page.