If you’ve used Google or YouTube, you’ve probably accessed sharded data. Sharding helps you scale out your database by storing partitions of your data across multiple servers instead of putting everything on a single giant one. This post will walk through how database sharding works, how to think about implementing your own sharded database, and some useful tools out there that can help, with a particular focus on MySQL and Postgres.
Scene: you’ve upsized your MySQL on RDS instance for the 3rd time this quarter and your CFO just put 30 minutes on your calendar to “chat budget.” It might be time to scale out instead of scaling up!  Read replicas in RDS seem straightforward enough, but reading data is only half of the problem. What is an overwhelmed developer to do?
Sharding — a term that probably originally came from a video game — is how you scale out relational databases. You’ve probably seen this table before, about how scaling out helps you take this users table, all stored on a single server:
And turn it into this users table, stored across 2 (or 1,000) servers:
But that’s only one type of sharding (row level, or horizontal). There are tons of different ways to split up your data across servers to best match how your business and data model works. Vertical sharding, for example, is when you split things at the schema or table level. More on this later!
Partitioning has existed – especially in OLAP setups – for a long time, primarily as a mechanism for improving query speed. Nightmares of sifting through HDFS partitions to find the missing snapshot pervade my sleep schedule... Anyway, sharding takes that concept and applies it to distributed systems: in addition to splitting up data into logical groups, let’s put those groups across multiple servers that talk to each other. Even Oracle does it!
For as long as relational databases have existed, they’ve been designed to run on a single server. Partially because of that, and partially because of fundamental laws of physics, sharding your data properly is, uh, not very easy.
To shard your database, you’ll need to do a few things:
- Decide on a sharding scheme — What data gets split up, and how? How is it organized?
- Organize your target infrastructure — How many servers are you sharding to? How much data will be on each one?
- Create a routing layer — How does your application know where to store new data, and query existing data?
- Planning and executing the migration — How do you migrate from a single database to many with minimal downtime?
There’s no hard and fast playbook for each; everyone’s data model and business constraints are different. Let’s dive in.
How you decide to split up your data into shards – also referred to as your partition strategy – should be a direct function of how your business runs, and where your query load is concentrated. For a B2B SaaS company where every user belongs to an organization, sharding by splitting up organization-level data probably makes sense. If you’re a consumer company, you may want to shard based on a random hash. Notion manually sharded their Postgres database by simply splitting on team ID. All of this is to say that sharding can be as simple or as complicated as you make it.
With that in mind, there are a few popular “algorithms” to decide which rows are stored together and on which servers:
- Hash based sharding (also known as key based) – Take a value from the row, hash it, and send buckets of hashes to the same server. Whichever column you choose to hash is your shard key.
- Range based sharding – Pick a column, create ranges, and allocate shards based on those ranges. Most useful for numerical columns that are (somewhat) randomly or evenly distributed.
- Directory based sharding – Pick a column, allocate shards manually, and maintain a lookup table so you know where each row is stored.
If your sharding scheme isn’t random (e.g. hash based), you can begin to see why query profiling and understanding how your load is distributed can be useful.
Imagine you’re Amazon, and you want to shard your MySQL database that stores customer orders. On the surface, there seems to be no meaningful clustering: sure, you’ve got customers who order a lot of stuff, but that volume (and the associated reads during the shopping process) are basically random. It might make sense to use hash based sharding, and use the order ID as the shard key.
A big part of your sharding scheme is considering which tables are stored together. Joins across databases in a distributed system are difficult and costly, so ideally all of the data you need to answer a particular query exists on the same physical machine. For Amazon, that means the
orders table and the
products table containing the products in the
orders table need to be physically colocated. This also requires incremental maintenance: if a customer makes a new order, the product data for that order needs to be included in the new shard so it can be read quickly later on.
Sharding maintenance is an oft underappreciated piece of scaling out your relational database. Depending on what your partition strategy is, you’ll likely end up with hotspots, where a particular server in your cluster is either storing too much data or handling too much throughput. In our Amazon example, it could be because a large business started ordering a metric-ton of stuff, and all of their data is on one server. Managing those hotspots, redistributing data and load, and reorganizing your partition strategy to prevent future issues is part of what you’re signing up for when you shard.
With your sharding scheme set, it’s time to decide on how many machines you want to store data on, and how big you need them to be. There’s no formula here; this decision depends on your budget, projections for future database load, cloud provider, etc.
A common approach is maximizing flexibility. Start with a small number of hosts, and add more as needed. To maintain an even distribution of shards across your servers, you’ll need to re-balance every time you add a host. This is why companies like to choose a number of shards that’s divisible by a lot of smaller numbers; it allows you to scale out the number of servers incrementally while maintaining that smooth, even distribution.
With your data distributed across multiple databases (imagine 20 of them), how does your application know which database to query? You need to build some sort of routing layer that decides. But how?
For those building sharding from scratch, the most common answer is in the application layer. You need to build logic into your application code that decides which database (and schema) to connect to for a particular query, conditional on the data inside that query and where it belongs in your sharding scheme. The logic looks something like:
if data.sharding_key in database_1.sharding_keys:…connect to database_1else if data.sharding_key in database_2.sharding_keys:…connect to database_2
Depending on how you’ve partitioned your data and the number of physical machines / databases you’re working with, this logic can be relatively simple and stored in a JSON blob, config file, etc. More commonly, teams will use some sort of key value store or a lookup table in a database. The important thing is to have the information that ties a piece of data to its destination encoded somewhere so your application knows where to issue the query.
Building this for the first time is actually not that difficult; it’s the operational maintenance that becomes the real problem over time. If you move shards from database to database, rebalance, add new machines, remove machines, change any database properties…you’ll need to update that application logic to account for it. ProxySQL isn’t a full fledged solution for this, but it could be classified as a rough “shard routing” service.
Once you’ve taken care of all of the above and have your physical servers running with empty databases on them, plus a plan for routing in your application logic, you’re faced with the age-old problem of how to migrate without (too much) downtime. Unlike a (potentially) more straightforward migration to a single new database provider, moving to sharding introduces a lot more things that can go wrong and in more ways.
Notion’s engineering team suggested a useful framework for thinking about the migration in their post about how they implemented sharding:
Double-write: Incoming writes get applied to both the old and new databases.
Backfill: Once double-writing has begun, migrate the old data to the new database.
Verification: Ensure the integrity of data in the new database.
Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads.
Each of these steps still introduces the possibility of downtime; it’s just a risk you’re going to have to take for changes at this scale.
Though many teams do build sharding for their database of choice from scratch, there is an ecosystem of tools, albeit perhaps less mature than the database software they’re built on.
Vitess was built at YouTube when they needed to shard MySQL, and is now available to you and me. It’s basically a layer on top of MySQL that gives you sharding, and a lot of other neat stuff related to really big workloads: connection pooling, dynamic re-sharding and balancing, and monitoring tools, among other things. For a technical overview of how Vitess improves on vanilla MySQL, check out their comparison here.
As far as I’m aware, Vitess is the most mature and the most popular OSS sharding layer for a relational database. It served all YouTube DB traffic for years, and is in production at Slack, GitHub, NewRelic, Pinterest, Square, etc.
Citus does what Vitess does for MySQL, but for Postgres (minus some more flashy features). It’s open source, designed as a Postgres extension, and can be run as a single node or several. It’s in production at Algolia, Heap, Cisco, and a few more. Their docs have good general advice for picking your sharding scheme, Citus or otherwise.
I suppose the more fundamental question is: why are you not using a database that does sharding for you? Over the past few years the so-called “serverless” database has gotten a lot more traction. Starting with the infamous Spanner paper, many have been thinking about how running a distributed system should be native to the database itself, the foremost of which has been CockroachDB. You can even run cloud Spanner on GCP.
You’re reading this blog on the PlanetScale website. They sell a shard-native (did I just coin this?) database built on MySQL and Vitess. I’m not a PlanetScale employee, but I am a big proponent of what they’re doing, specifically shifting the focus in databases to developer experience instead of infrastructure maintenance.
The question is starting to become: if you’re paying someone like AWS to run your database for you, why are you busy figuring out how to scale out that database? And I think that’s a good question the major cloud providers should be asking themselves.
 There is no shortage of opinion pieces on the web arguing against premature sharding. This post assumes an educated reader who can judge when scaling out is the right decision and when it isn’t.