Vitess has native support for sharding data across many MySQL instances. Not only does it have excellent support for this, but Vitess gives you the flexibility to customize sharding in a bunch of different ways, whether that be simple hash-based sharding, custom sharding functions, geographic sharding, or others. In this lesson, let's go over a simple example of how you can get a sharded cluster set up in Vitess.
In the vitess/examples/local
directory, there is a sequence of scripts beginning with 3
that allow you to quickly get up and running with a local sharded database. These scripts assume that you have already executed the 100-level and 200-level scripts, so if you have not executed those, run them all in sequence.
The 301_customer_sharded.sh
script runs a sequence of vtctldclient
commands for preparing the schema of your database for sharding. The first step is that it creates two new tables, customer_seq
and order_seq
. These tables are going to be used to store sequences of IDs for the sharded customer
and corder
tables. When sharding data across multiple MySQL instances using a hash of an ID, it can be useful to have designated tables used to manage the ID sequence. That is what these will be used for.
After getting this set up, two ApplyVSchema
commands are run. These are used to tell Vitess that the two previously-created tables should be used for the auto-incrementing IDs, and to inform Vites how we want the customer
and corder
tables to be sharded. In this example, they will be both be sharded based on a hash of the ID of each row. The resulting hash will be used to determine which shard to send the row to. Half of the hash range will go to one shard (-80
), and the other half will go to the other (80-
). In our example we will only be creating two shards, but this same technique can be used for clusters that have hundreds or thousands of shards.
Finally, the last command updates the type for the ID of the customer
and corder
table.
The next script is 302_new_shards.sh
. This one is responsible for setting up the infrastructure for the sharded environment. If you take a peek in these scripts, you can see that it spins up six instances of MySQL along with six corresponding VTTablets.
The first three instances are the primary, replica, and rdonly nodes for the first shard. The other three are the primary, replica, and rdonly node for the second shard. Two shards, each with a single primary being replicated to two other MySQL instances. This replication allows for better high-availability support and the ability to spread out incoming queries across these nodes (with the correct use of @replica
syntax).
The next step is to begin the sharding process from the unsharded keyspace to the sharded one. This is done by the 303_reshard.sh
script. The key command in this script is:
vtctldclient Reshard --workflow cust2cust --target-keyspace customer create --source-shards '0' --target-shards '-80,80-'
This uses the vtctldclient
command to run the Reshard
workflow, and it specifies that we want to (re)shard the customer
keyspace. The tables / data for this reshard are going to come from shard 0
, which is the one that was created by the 200-level scripts. The data will be copied from there into the new shards, named -80
and 80-
. The -80
and 80-
represent the ID hash ranges each of the shards will store data for. All ID hashes that fall in the first half of the hash output will land on the -80
shard, and the rest will go to 80-
. To learn more about sharded keyspace, check out the Vitess docs.
303_reshard.sh
copies the data into the new sharded keyspace, but traffic (queries) to the database are not routed to these new shards yet. To switch over the traffic, run 304_switch_reads.sh
and 305_switch_writes.sh
. After running these, the sharded nodes will handle queries to these two tables.
Now that we have the new shards up and running, we no longer need the old unsharded keyspace for corder
and customer
. To shut these down and remove the old nodes, run 306_down_shard_0.sh
and 307_delete_shard_0.sh
.
You now have a horizontally sharded database. If you execute queries on the corder
and customer
tables, you should be able to see that some are handled by the first shard and others the second.