VTOrc is the automated fault detection and repair tool of Vitess.

Example Usage #

Start VTOrc as follows:

export TOPOLOGY_FLAGS="--topo_implementation etcd2 --topo_global_server_address localhost:2379 --topo_global_root /vitess/global"
export VTDATAROOT="/tmp"

vtorc \
  --log_dir $VTDATAROOT/tmp \
  --port 15000 \
  --recovery-period-block-duration "10m" \
  --instance-poll-time "1s" \
  --topo-information-refresh-duration "30s" \

Options #

The following command line options apply to VTOrc:

--alsologtostderrbooleanlog to standard error as well as files
--audit-file-locationstringFile location where the audit logs are to be stored
--audit-purge-durationdurationDuration for which audit logs are held before being purged. Should be in multiples of days (default 168h0m0s)
--audit-to-backendbooleanWhether to store the audit log in the VTOrc database
--audit-to-syslogbooleanWhether to store the audit log in the syslog
--catch-sigpipebooleancatch and ignore SIGPIPE on stdout and stderr if specified
--clusters_to_watchstringsComma-separated list of keyspaces or keyspace/shards that this instance will monitor and repair. Defaults to all clusters in the topology. Example: "ks1,ks2/-80"
--configstringconfig file name
--consul_auth_static_filestringJSON File to read the topos/tokens from.
--grpc_auth_static_client_credsstringWhen using grpc_static_auth in the server, this file provides the credentials to use to authenticate with server.
--grpc_compressionstringWhich protocol to use for compressing gRPC. Default: nothing. Supported: snappy
--grpc_enable_tracingbooleanEnable gRPC tracing.
--grpc_initial_conn_window_sizeintgRPC initial connection window size
--grpc_initial_window_sizeintgRPC initial window size
--grpc_keepalive_timedurationAfter a duration of this time, if the client doesn't see any activity, it pings the server to see if the transport is still alive. (default 10s)
--grpc_keepalive_timeoutdurationAfter having pinged for keepalive check, the client waits for a duration of Timeout and if no activity is seen even after that the connection is closed. (default 10s)
--grpc_max_message_sizeintMaximum allowed RPC message size. Larger messages will be rejected by gRPC with the error 'exceeding the max size'. (default 16777216)
--grpc_prometheusbooleanEnable gRPC monitoring with Prometheus.
-h, --helpbooleandisplay usage and exit
--instance-poll-timedurationTimer duration on which VTOrc refreshes MySQL information (default 5s)
--keep_logsdurationkeep logs for this long (using ctime) (zero to keep forever)
--keep_logs_by_mtimedurationkeep logs for this long (using mtime) (zero to keep forever)
--lameduck-perioddurationkeep running at least this long after SIGTERM before stopping (default 50ms)
--lock-timeoutdurationMaximum time for which a shard/keyspace lock can be acquired for (default 45s)
--log_backtrace_attraceLocationwhen logging hits line file:N, emit a stack trace (default :0)
--log_dirstringIf non-empty, write log files in this directory
--log_err_stacksbooleanlog stack traces for errors
--log_rotate_max_sizeuintsize in bytes at which logs are rotated (glog.MaxSize) (default 1887436800)
--logtostderrbooleanlog to standard error instead of files
--onclose_timeoutdurationwait no more than this for OnClose handlers before stopping (default 10s)
--onterm_timeoutdurationwait no more than this for OnTermSync handlers before stopping (default 10s)
--pid_filestringIf set, the process will write its pid to the named file, and delete it on graceful shutdown.
--portintport for the server
--pprofstringsenable profiling
--prevent-cross-cell-failoverbooleanPrevent VTOrc from promoting a primary in a different cell than the current primary in case of a failover
--purge_logs_intervaldurationhow often try to remove old logs (default 1h0m0s)
--reasonable-replication-lagdurationMaximum replication lag on replicas which is deemed to be acceptable (default 10s)
--recovery-period-block-durationdurationDuration for which a new recovery is blocked on an instance after running a recovery (default 30s)
--recovery-poll-durationdurationTimer duration on which VTOrc polls its database to run a recovery (default 1s)
--remote_operation_timeoutdurationtime to wait for a remote operation (default 15s)
--security_policystringthe name of a registered security policy to use for controlling access to URLs - empty means allow all for anyone (built-in policies: deny-all, read-only)
--shutdown_wait_timedurationMaximum time to wait for VTOrc to release all the locks that it is holding before shutting down on SIGTERM (default 30s)
--snapshot-topology-intervaldurationTimer duration on which VTOrc takes a snapshot of the current MySQL information it has in the database. Should be in multiple of hours
--sqlite-data-filestringSQLite Datafile to use as VTOrc's database (default "file::memory:?mode=memory&cache=shared")
--stderrthresholdseveritylogs at or above this threshold go to stderr (default 1)
--tablet_manager_grpc_castringthe server ca to use to validate servers when connecting
--tablet_manager_grpc_certstringthe cert to use to connect
--tablet_manager_grpc_concurrencyintconcurrency to use to talk to a vttablet server for performance-sensitive RPCs (like ExecuteFetchAs{Dba,AllPrivs,App}) (default 8)
--tablet_manager_grpc_connpool_sizeintnumber of tablets to keep tmclient connections open to (default 100)
--tablet_manager_grpc_crlstringthe server crl to use to validate server certificates when connecting
--tablet_manager_grpc_keystringthe key to use to connect
--tablet_manager_grpc_server_namestringthe server name to use to validate server certificate
--tablet_manager_protocolstringProtocol to use to make tabletmanager RPCs to vttablets. (default "grpc")
--topo-information-refresh-durationdurationTimer duration on which VTOrc refreshes the keyspace and vttablet records from the topology server (default 15s)
--topo_consul_lock_delaydurationLockDelay for consul session. (default 15s)
--topo_consul_lock_session_checksstringList of checks for consul session. (default "serfHealth")
--topo_consul_lock_session_ttlstringTTL for consul session.
--topo_consul_watch_poll_durationdurationtime of the long poll for watch queries. (default 30s)
--topo_etcd_lease_ttlintLease TTL for locks and leader election. The client will use KeepAlive to keep the lease going. (default 30)
--topo_etcd_tls_castringpath to the ca to use to validate the server cert when connecting to the etcd topo server
--topo_etcd_tls_certstringpath to the client cert to use to connect to the etcd topo server, requires topo_etcd_tls_key, enables TLS
--topo_etcd_tls_keystringpath to the client key to use to connect to the etcd topo server, enables TLS
--topo_global_rootstringthe path of the global topology data in the global topology server
--topo_global_server_addressstringthe address of the global topology server
--topo_implementationstringthe topology implementation to use
--topo_k8s_contextstringThe kubeconfig context to use, overrides the 'current-context' from the config
--topo_k8s_kubeconfigstringPath to a valid kubeconfig file. When running as a k8s pod inside the same cluster you wish to use as the topo, you may omit this and the below arguments, and Vitess is capable of auto-discovering the correct values. https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod
--topo_k8s_namespacestringThe kubernetes namespace to use for all objects. Default comes from the context or in-cluster config
--topo_zk_auth_filestringauth to use when connecting to the zk topo server, file contents should be :, e.g., digest:user:pass
--topo_zk_base_timeoutdurationzk base timeout (see zk.Connect) (default 30s)
--topo_zk_max_concurrencyintmaximum number of pending requests to send to a Zookeeper server. (default 64)
--topo_zk_tls_castringthe server ca to use to validate servers when connecting to the zk topo server
--topo_zk_tls_certstringthe cert to use to connect to the zk topo server, requires topo_zk_tls_key, enables TLS
--topo_zk_tls_keystringthe key to use to connect to the zk topo server, enables TLS
--vvaluelog level for V logs
--versionbooleanprint binary version
--vmodulevaluecomma-separated list of pattern=N settings for file-filtered logging
--wait-replicas-timeoutdurationDuration for which to wait for replica's to respond when issuing RPCs (default 30s)