Horizontal Sharding

The DBAs you hired with massive troves of hipster cash are pinging you on Slack and are freaking out. With the amount of data that you’re loading up in your keyspaces, MySQL performance is starting to tank - it’s okay, you’re prepared for this! Although the query guardrails and connection pooling are cool features that Vitess can offer to a single unsharded keyspace, the real value comes into play with horizontal sharding.

Preparation

Before starting the resharding process, you need to make some decisions and prepare the system for horizontal resharding. Important note, this is something that should have been done before starting the vertical split. However, this is a good time to explain what normally would have been decided upon earlier the process.

Sequences

The first issue to address is the fact that customer and corder have auto-increment columns. This scheme does not work well in a sharded setup. Instead, Vitess provides an equivalent feature through sequences.

The sequence table is an unsharded single row table that Vitess can use to generate monotonically increasing ids. The syntax to generate an id is: select next :n values from customer_seq. The vttablet that exposes this table is capable of serving a very large number of such ids because values are cached and served out of memory. The cache value is configurable.

The VSchema allows you to associate a column of a table with the sequence table. Once this is done, an insert on that table transparently fetches an id from the sequence table, fills in the value, and routes the row to the appropriate shard. This makes the construct backward compatible to how MySQL’s auto_increment property works.

Since sequences are unsharded tables, they will be stored in the commerce database. The schema:

CREATE TABLE customer_seq (id int, next_id bigint, cache bigint, primary key(id)) comment 'vitess_sequence';
INSERT INTO customer_seq (id, next_id, cache) VALUES (0, 1000, 100);
CREATE TABLE order_seq (id int, next_id bigint, cache bigint, primary key(id)) comment 'vitess_sequence';
INSERT INTO order_seq (id, next_id, cache) VALUES (0, 1000, 100);

Note the vitess_sequence comment in the create table statement. VTTablet will use this metadata to treat this table as a sequence.

  • id is always 0
  • next_id is set to 1000: the value should be comfortably greater than the auto_increment max value used so far.
  • cache specifies the number of values to cache before vttablet updates next_id.

Larger cache values perform better, but will exhaust the values quicker since during reparent operations the new master will start off at the next_id value.

The VTGate servers also need to know about the sequence tables. This is done by updating the VSchema for commerce as follows:

{
  "tables": {
    "customer_seq": {
      "type": "sequence"
    },
    "order_seq": {
      "type": "sequence"
    },
    "product": {}
  }
}

Vindexes

The next decision is about the sharding keys, aka Primary Vindexes. This is a complex decision that involves the following considerations:

  • What are the highest QPS queries, and what are the where clauses for them?
  • Cardinality of the column; it must be high.
  • Do we want some rows to live together to support in-shard joins?
  • Do we want certain rows that will be in the same transaction to live together?

Using the above considerations, in our use case, we can determine that:

  • For the customer table, the most common where clause uses customer_id. So, it shall have a Primary Vindex.
  • Given that it has lots of users, its cardinality is also high.
  • For the corder table, we have a choice between customer_id and order_id. Given that our app joins customer with corder quite often on the customer_id column, it will be beneficial to choose customer_id as the Primary Vindex for the corder table as well.
  • Coincidentally, transactions also update corder tables with their corresponding customer rows. This further reinforces the decision to use customer_id as Primary Vindex.

NOTE: It may be worth creating a secondary lookup Vindex on corder.order_id. This is not part of the example. We will discuss this in the advanced section.

NOTE: For some use cases, customer_id may actually map to a tenant_id. In such cases, the cardinality of a tenant id may be too low. It’s also common that such systems have queries that use other high cardinality columns in their where clauses. Those should then be taken into consideration when deciding on a good Primary Vindex.

Putting it all together, we have the following VSchema for customer:

{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "customer": {
      "column_vindexes": [
        {
          "column": "customer_id",
          "name": "hash"
        }
      ],
      "auto_increment": {
        "column": "customer_id",
        "sequence": "customer_seq"
      }
    },
    "corder": {
      "column_vindexes": [
        {
          "column": "customer_id",
          "name": "hash"
        }
      ],
      "auto_increment": {
        "column": "order_id",
        "sequence": "order_seq"
      }
    }
  }
}

Note that we have now marked the keyspace as sharded. Making this change will also change how Vitess treats this keyspace. Some complex queries that previously worked may not work anymore. This is a good time to conduct thorough testing to ensure that all the queries work. If any queries fail, you can temporarily revert the keyspace as unsharded. You can go back and forth until you have got all the queries working again.

Since the primary vindex columns are BIGINT, we choose hash as the primary vindex, which is a pseudo-random way of distributing rows into various shards.

NOTE: For VARCHAR columns, use unicode_loose_md5. For VARBINARY, use binary_md5.

NOTE: All vindexes in Vitess are plugins. If none of the predefined vindexes suit your needs, you can develop your own custom vindex.

Now that we have made all the important decisions, it’s time to apply these changes:

./301_customer_sharded.sh

Create new shards

At this point, you have finalized your sharded VSchema and vetted all the queries to make sure they still work. Now, it’s time to reshard.

The resharding process works by splitting existing shards into smaller shards. This type of resharding is the most appropriate for Vitess. There are some use cases where you may want to spin up a new shard and add new rows in the most recently created shard. This can be achieved in Vitess by splitting a shard in such a way that no rows end up in the ‘new’ shard. However, it’s not natural for Vitess.

We have to create the new target shards:

./302_new_shards.sh

Shard 0 was already there. We have now added shards -80 and 80-. We’ve also added the CopySchema directive which requests that the schema from shard 0 be copied into the new shards.

Shard naming

What is the meaning of -80 and 80-? The shard names have the following characteristics:

  • They represent a range, where the left number is included, but the right is not.
  • Their notation is hexadecimal.
  • They are left justified.
  • A - prefix means: anything less than the RHS value.
  • A - postfix means: anything greater than or equal to the LHS value.
  • A plain - denotes the full keyrange.

What does this mean: -80 == 00-80 == 0000-8000 == 000000-800000

80- is not the same as 80-FF. This is why:

80-FF == 8000-FF00. Therefore FFFF will be out of the 80-FF range.

80- means: ‘anything greater than or equal to 0x80

A hash vindex produces an 8-byte number. This means that all numbers less than 0x8000000000000000 will fall in shard -80. Any number with the highest bit set will be >= 0x8000000000000000, and will therefore belong to shard 80-.

This left-justified approach allows you to have keyspace ids of arbitrary length. However, the most significant bits are the ones on the left.

For example an md5 hash produces 16 bytes. That can also be used as a keyspace id.

A varbinary of arbitrary length can also be mapped as is to a keyspace id. This is what the binary vindex does.

In the above case, we are essentially creating two shards: any keyspace id that does not have its leftmost bit set will go to -80. All others will go to 80-.

Applying the above change should result in the creation of six more vttablet instances.

At this point, the tables have been created in the new shards but have no data yet.

mysql --table < ../common/select_customer-80_data.sql
Using customer/-80
Customer
COrder
mysql --table < ../common/select_customer80-_data.sql
Using customer/80-
Customer
COrder

SplitClone

The process for SplitClone is similar to VerticalSplitClone. It starts the horizontal resharding process:

./303_horizontal_split.sh

This starts the following job “SplitClone -min_healthy_rdonly_tablets=1 customer/0”:

For large tables, this job could potentially run for many days, and can be restarted if failed. This job performs the following tasks:

  • Dirty copy data from customer/0 into the two new shards. But rows are split based on their target shards.
  • Stop replication on customer/0 rdonly tablet and perform a final sync.
  • Start a filtered replication process from customer/0 into the two shards by sending changes to one or the other shard depending on which shard the rows belong to.

Once SplitClone has completed, you should see this:

The horizontal counterpart to VerticalSplitDiff is SplitDiff. It can be used to validate the data integrity of the resharding process “SplitDiff -min_healthy_rdonly_tablets=1 customer/-80”:

NOTE: This example does not actually run this command.

Note that the last argument of SplitDiff is the target (smaller) shard. You will need to run one job for each target shard. Also, you cannot run them in parallel because they need to take an rdonly instance offline to perform the comparison.

NOTE: SplitDiff can be used to split shards as well as to merge them.

Cut over

Now that you have verified that the tables are being continuously updated from the source shard, you can cutover the traffic. This is typically performed in three steps: rdonly, replica and master:

For rdonly and replica:

./304_migrate_replicas.sh

For master:

./305_migrate_master.sh

During the master migration, the original shard master will first stop accepting updates. Then the process will wait for the new shard masters to fully catch up on filtered replication before allowing them to begin serving. Since filtered replication has been following along with live updates, there should only be a few seconds of master unavailability.

The replica and rdonly cutovers are freely reversible. Unlike the Vertical Split, a horizontal split is also reversible. You just have to add a -reverse_replication flag while cutting over the master. This flag causes the entire resharding process to run in the opposite direction, allowing you to Migrate in the other direction if the need arises.

You should now be able to see the data that has been copied over to the new shards.

mysql --table < ../common/select_customer-80_data.sql
Using customer/-80
Customer
+-------------+--------------------+
| customer_id | email              |
+-------------+--------------------+
|           1 | alice@domain.com   |
|           2 | bob@domain.com     |
|           3 | charlie@domain.com |
|           5 | eve@domain.com     |
+-------------+--------------------+
COrder
+----------+-------------+----------+-------+
| order_id | customer_id | sku      | price |
+----------+-------------+----------+-------+
|        1 |           1 | SKU-1001 |   100 |
|        2 |           2 | SKU-1002 |    30 |
|        3 |           3 | SKU-1002 |    30 |
|        5 |           5 | SKU-1002 |    30 |
+----------+-------------+----------+-------+

mysql --table < ../common/select_customer80-_data.sql
Using customer/80-
Customer
+-------------+----------------+
| customer_id | email          |
+-------------+----------------+
|           4 | dan@domain.com |
+-------------+----------------+
COrder
+----------+-------------+----------+-------+
| order_id | customer_id | sku      | price |
+----------+-------------+----------+-------+
|        4 |           4 | SKU-1002 |    30 |
+----------+-------------+----------+-------+

Clean up

After celebrating your second successful resharding, you are now ready to clean up the leftover artifacts:

./306_down_shard_0.sh

In this script, we just stopped all tablet instances for shard 0. This will cause all those vttablet and mysqld processes to be stopped. But the shard metadata is still present. We can clean that up with this command (after all vttablets have been brought down):

./307_delete_shard_0.sh

This command runs the following “DeleteShard -recursive customer/0”.

Beyond this, you will also need to manually delete the disk associated with this shard.

Next Steps

Feel free to experiment with your Vitess cluster! Execute the following when you are ready to teardown your example:

./401_teardown.sh