Troubleshooting Distributed Atomic Transactions

For users running sharded keyspaces in Vitess, we support atomic transactions across multiple shards using 2PC (Two-Phase Commit). You can find more details on configuring your cluster for distributed transactions in the Distributed Transactions documentation.

Although distributed transactions are designed to run smoothly, issues may occasionally occur. This guide will help you troubleshoot and resolve these issues.

Finding Issues #

There are several methods to determine whether something has gone wrong in distributed transactions:

  1. Monitor vttablet metrics - The vttablet process exposes various metrics that can be used to monitor the health of the system. Here are the ones to look out for -
    1. CommitPreparedFail - This metric indicates the number of times a commit has failed after being prepared. It is categorized into two classes of errors: retryable and non-retryable. Non-retryable errors will require human intervention.
    2. RedoPreparedFail - Similar to CommitPreparedFail, this metric shows the number of times redoing a transaction has failed after preparation. Non-retryable errors here will also need human intervention.
  2. Unresolved-list CLI Command - This CLI command lists unresolved transactions in the system. You can check the state of these transactions to determine if any are stuck in a failed state.
  3. Transactions page in VTAdmin - The VTAdmin UI provides a page to view all unresolved transactions in the system. This view can be used to get the DTIDs for such transactions.

Fixing Issues #

Once a transaction has been identified as irreversibly failed, human intervention is required to correct the database state. You will need the DTID(s) from the previous step. Follow these steps:

  1. Obtain detailed information about the transaction using its DTID by running the Get-info CLI command.
  2. The command output will list the participating shards in the transaction, along with the status of the transaction on each shard. For shards where the transaction has failed, it will also display the writes that were part of the transaction.
  3. Using this information, rerun the relevant writes directly on each of the failing shards.
  4. After making the necessary changes, conclude the transaction by running the Conclude CLI command.

You can also find this information on the VTAdmin page by navigating to the transactions section and clicking on the transaction in question.

If you do run into distributed transaction failures, please open a GitHub issue to allow further investigation and resolution of the problem.


Troubleshooting Distributed Atomic Transactions