Troubleshooting Distributed Atomic Transactions

For users running sharded keyspaces in Vitess, we support atomic transactions across multiple shards using 2PC (Two-Phase Commit). You can find more details on configuring your cluster for distributed transactions in the Distributed Transactions documentation.

Although distributed transactions are designed to run smoothly, issues may occasionally occur. This guide will help you troubleshoot and resolve these issues.

Finding Issues #

There are several methods to determine whether something has gone wrong in distributed transactions:

  1. Monitor vttablet metrics - The vttablet process exposes various metrics that can be used to monitor the health of the system. Here are the ones to look out for -
    1. CommitPreparedFail - This metric indicates the number of times a commit has failed after being prepared. It is categorized into two classes of errors: retryable and non-retryable. Non-retryable errors will require human intervention.
    2. RedoPreparedFail - Similar to CommitPreparedFail, this metric shows the number of times redoing a transaction has failed after preparation. Non-retryable errors here will also need human intervention.
  2. Unresolved-list CLI Command - This CLI command lists unresolved transactions in the system. You can check the state of these transactions to determine if any are stuck in a failed state.
  3. Transactions page in VTAdmin - The VTAdmin UI provides a page to view all unresolved transactions in the system. This view can be used to get the DTIDs for such transactions.

Fixing Issues #

Once a transaction has been identified as irreversibly failed, human intervention is required to correct the database state. You will need the DTID(s) from the previous step. Follow these steps:

  1. Query the redo_state and redo_statement tables on all participating shards using the DTID to get more information about the transaction.
  2. Each shard will display the status of the transaction on that shard. For shards where the transaction has failed, the information will include the writes that were part of the transaction.
  3. Using this information, rerun the relevant writes directly on each of the failing shards.
  4. After making the necessary changes, conclude the transaction by running the Conclude CLI command.

If you do run into distributed transaction failures, please open a GitHub issue to allow further investigation and resolution of the problem.


Troubleshooting Distributed Atomic Transactions