Introducing VDiff V2

Introducing VDiff V2

Vitess is a solution that allows you to infinitely scale MySQL while providing clients and apps with a single logical view of the fleet of MySQL instances comprising any number of Keyspaces and Shards.

Vitess also provides the cluster and data management tools that make it possible to manage a massive cluster and perform complex workflows using VReplication, such as:

Why a Diff Tool? #

Data is typically one of the most critical assets within an organization. As such, an operator needs to be able to verify the correctness of this data, in particular as the data is moved around or otherwise transformed. For example, operators have wanted a way to verify data consistency after replicating data from one MySQL instance to another or dumping a table from one instance and loading it in another. However, even for a single table in these simplest of cases — performing a safe, reliable, light-weight, and performant online diff between two MySQL instances is a suprisingly difficult problem. Due to the challenges involved, there have been few attempted general solutions with the most notable being:

With Vitess, the need for a data diff tool is even more pronounced because you'll be migrating data from your legacy systems into Vitess, migrating data across keyspaces, and performing a variety of other workflows. This is further complicated by the fact that these workflows may be done across MySQL versions, data centers, with differing schemas between the source and target, and over long time periods in which your data evolves. So it is critical to have a tool that can reliably perform a logical diff between the source and target of these workflows, in a timely manner, and without impacting production traffic.

VDiff #

Vitess provided a solution for diffing tables that are part of a VReplication workflow called VDiff. I will walk through the basic algorithm or flow used for diffing each table in order to demonstrate the challenges involved and how we solved them in VDiff:

  • vtctld selects tablets in the source and target shards to use for the comparison — one per shard on each side
  • On the target tablets: stop the VReplication workflow for the VDiff operation, to "freeze" the state, and record the current   GTID position in the VStream
  • On the source tablets:
    • wait for replication to catch up to at least where the target is (remember that the source instance may be a replica and the target a primary)
    • lock the table to get the current GTID_EXECUTED which gives us a logical point in time that will correspond to the the read view in our upcoming transaction
    • issue START TRANSACTION WITH CONSISTENT SNAPSHOT
    • unlock the table as we now have a consistent snapshot of the table data and the GTID metadata that are both at the same logical point in time with regards to the table we're diffing
  • On the target tablets:
    • start VReplication UNTIL we have reached that GTID_EXECUTED position in the VStream which matches the one we saved when setting up the read view on the source
    • issue START TRANSACTION WITH CONSISTENT SNAPSHOT (remember that the state is "frozen" on the target tablet) — now the target context is at the same logical point in time as the source for this table
  • On the source and target tablets: issue SELECT <cols> FROM <table> ORDER BY <pkcols> (for deterministic ordering and to avoid a filesort)
  • In vtctld : stream the results from those SELECTs, doing a merge sort from the shards, and compare the rows on both sides logically, as the schema may be different on either side, keeping a record of any differences seen
  • On the target tablets: restart the VReplication workflow
  • On the source and target tablets: close the open transaction with a ROLLBACK
  • Finally the vtctl client prints a report (to STDOUT) of the results
For large tables, holding a transaction open on the source tablets can have a significant impact on normal query traffic due to InnoDB MVCC needing to keep those older versions of rows around if they are updated after the transaction started (innodb_history_list_length). For this reason, I would recommend always using REPLICA tablets for VDiff operations whenever you can (when the source is an unmanaged tablet, such as when e.g. moving from RDS into Vitess, you may only have a PRIMARY tablet available). You can control that using the --tablet_types=REPLICA flag for the VDiff command. In v14+ the default was changed to: --tablet_types=in_order:RDONLY,REPLICA,PRIMARY.

The original version worked very well but it suffered from some limitations that posed challenges in certain situations such as when working with very large tables. For example, if you have over 1TiB of data that needs to be compared the VDiff could take a week to complete. If during this time you had any failure such as one of the MySQL connections used getting closed (e.g. due to wait_timeout or net_write_timeout) then you'd have to start the entire operation over again from scratch.

We processed feedback from Vitess users over the course of 2+ years as they used VDiff in production and a set of underlying issues started to become clear:

  • Fragility — any connection loss, process failure, failover etc. would cause the VDiff to fail and need to be re-run
  • Synchronous command — the vtctl client command would block until the VDiff completed which posed some challenges and required a stable machine where e.g. a tmux session could be used for the client call
  • VTCtld as the controller — the Vitess cluster management daemon is generally a lightweight process used to coordinate complex operations that span many Vitess components. It's not designed to be used for operations that span days and require the resources needed to compare 100s of GiBs of data
    • Network traffic — the tablets on each side of the VDiff streamed their data to the vtctld process which then compared the data. This generated a lot of network traffic which could become a bottleneck and impact overall network bandwidth and latency. Keep in mind that it's common for the data involved to reside in 3 or more failure domains / availability zones.
  • No progress reporting — the VDiff could run for days without any indication of overall progress
  • Execution time — the VDiff could take days or weeks to complete for very large tables, in large part because there was very little concurrency with a single vtctld process doing the bulk of the work

We set out to create a new version of VDiff that addressed all of these issues.

VDiff V2 #

We started by largely rearranging the existing VDiff code so that instead of being managed and controlled by a single vtctld it's instead managed and executed — in parallel — by the primary tablet in each shard on the target side. This offers parallelism while also reducing the amount of network traffic needed to perform the diff. The operation was also made asynchronous, with the VDiff Show client command gathering and reporting the results of the VDiff operation from each of the target shards involved.

We then made VDiffs resumable so that if a failure occurs during the diff, the operation can be resumed from where it left off. This also makes it possible to do a rolling or incremental VDiff where you may perform the VDiff immediately after a workflow completes, and then again just before doing a cutover for added confidence as there may be weeks between those two stages. From there we added support for auto-restarting a VDiff if any ephemeral/recoverable error occurs. This means that you can have process crashes, failovers, network issues, etc and the VDiff will automatically recover and continue running.

We also added progress reporting so that you have some idea of how much work the VDiff has done, how much is left, and have an ETA for when it's likely to complete. This gives you greater peace of mind while a longer operation runs and better allows you to prepare for the next step once the VDiff completes.

There were a variety of other minor improvements as well. In total, we hope that this new version addresses the major set of issues that users had and provides a solid base for us to continue making further improvements.

Conclusion #

Vitess VReplication offers a set of powerful features that allow users to manage data workflows when that data is spread across a large fleet of MySQL instances. VDiff then provides an invaluable tool for verifying the correctness of these complex operations, giving you confidence and peace of mind as you execute the data operations required to better meet your evolving business needs and objectives over time.

Please try out VDiff v2 in Vitess 15.0 — where it's marked as experimental — and provide feedback! We hope to mark it as GA/production-ready in the upcoming 16.0 release and your feedback is invaluable. Special shout out to Arthur Schreiber @ GitHub for providing a lot of great early feedback that's helping to make the feature better! ♥️

Happy data migrations! 🚀 🚀 🚀