Recoverable, failover agnostic migrations

Vitess’s managed schema changes offer failover agnostic migrations in online strategy (VReplication based).

Normally, schema migrations are coupled with the original MySQL server they operate on. A gh-ost or a pt-online-schema-change, as well as plain direct migrations, may only complete on the same server where they started. Any form of failover, whether planned or unplanned, either breaks the migration or makes it obsolete.

online strategy migrations are agnostic to server promotion. A migration can begin on one primary tablet, and complete on another tablet which was promoted as primary throughout the migration. In large part this is a direct result of the nature of VReplication.

online migrations will auto-survive:

  • A planned failover (via PlannedReparentShard)
  • An emergency reparent (EmergencyReparentShard)
  • An unexpected external reparent
  • As long as no more than 10 minutes pass between failure/demotion of previous primary tablet and the promotion of the new primary tablet.

Behavior and limitations #

Whether by planned operation or an unplanned failure, an online migration’s VReplication stream is interrupted while copying/applying data. VReplication’s mechanism persists the state of data transfer transactionally with the transfer itself. Any replica will have a consistent state of the migration, even if that replica lags behind the primary.

When a replica tablet is promoted as primary, it notices the VReplication stream, which is meant to be active and running. It sets up the connections and processes to resume its work. It is possible that some retries will take place as the stream re-evaluates its source of data.

The Online DDL Scheduler detects the running stream, and identifies it as having been created by a different tablet. It assumes ownership of the stream and proceeds to follow its progress till completion.

The stream must be no more than 10 minutes stale, otherwise the scheduler marks the migration as failed.

There is no limitation on the number of failovers an online migration can survive.

No user action is required. Immediately after promotion/failover the migration will present as making no progress. It is likely to present progress within 1 or 2 minutes after promotion.


Recoverable, failover agnostic migrations