VTOrc is the automated fault detection and repair tool of Vitess.
It follows the following high level steps for finding problems and fixing them -
- It queries the
instance-poll-timeto gather information about the underlying MySQL instances.
- It also contacts the topology-server every
topo-information-refresh-durationto know the desired state of the topology.
- Based on the information collected, VTOrc finds if there are any configuration issue on any of the MySQL instances and takes appropriate actions.
- These are then fixed by issuing RPCs to the associated
Coordination among VTOrc instances and
Users are encouraged to run multiple instances of VTOrc monitoring the same cluster because VTOrc too, like any other service is liable to failure
for reasons out of its control. Also, users run
vtctld instances which can be used to run commands which alter the desired topology (PlannedReparentShard)
and durability requirements (SetKeyspaceDurabilityPolicy).
The most logical question that arises is how do we ensure coordination between multiple VTOrc instances and
We do so by using the existing central topology-server. Each of these services, acquire a shard lock before it proceeds to run any fixes. This ensures that there is only one actor at any given point of time trying to alter the cluster.
Another cause of concern could be recoveries run on stale data collected by VTOrc. Since VTOrc instances use a polling method to load the information they use for fault detection, they can sometimes read outdated information. To prevent VTOrc instances from running incorrect/unnecessary recoveries, all VTOrc instances refresh their local information that they require for the fix after acquiring a shard lock.