Analyzing SQL statements in bulk

Introduction #

This document covers the way the VTexplain tool can be used to evaluate if Vitess is compatible with a list of SQL statements. Enabling the evaluation of if queries from an existing application that accesses a MySQL database are generally Vitess-compatible. If there are any issues identified they can be used to target any necessary application changes needed for a successful migration from MySQL to Vitess.

Prerequisites #

You can find a prebuilt binary version of the VTexplain tool in the most recent release of Vitess.

You can also build the vtexplain binary in your environment. To build this binary, refer to the build guide for your OS.

Overview #

To analyze multiple SQL queries and determine how, or if, Vitess executes each statement, follow these steps:

Gather the queries from your current MySQL database environment
Filter out specific queries
Populate fake values for your queries
Run the VTexplain tool via a script
Add your SQL schema
Add your VSchema to the output file
Run the VTexplain tool and capture the output
Check your output for errors

1. Gather the queries from your current MySQL database environment #

These queries should be most, if not all, of the queries that are sent to your current MySQL database over an extended period of time. You may need to record your queries for days or weeks depending on the nature of your application(s) and workload. You will need to normalize the queries you will be analyzing. Depending on the scope and complexity of your applications you may have a few hundred to thousands of distinct normalized queries. To obtain normalized queries you can use common MySQL monitoring tools like VividCortex, Monyog or PMM.

It is also possible to use the MySQL general query log feature to capture raw queries and then normalize it using post-processing.

2. Filter out specific queries #

Remove from your list any unsupported queries or queries from non-application sources. The following are examples of queries to remove are:

LOCK/UNLOCK TABLES - These likely come from schema management tools, which VTGate obviates.
FLUSH/PURGE LOGS - Vitess performs its own log management.
performance_schema queries - These queries are not supported by Vitess.
BEGIN/COMMIT - Vitess supports these statements, but VTexplain does not.

The following is an example pipeline to filter out these specific queries:

cat queries.txt \
 | grep -v performance_schema \
 | grep -v information_schema \
 | grep -v @@ \
 | grep -v "SELECT ? $" \
 | grep -v "PURGE BINARY" \
 | grep -v "^SET" \
 | grep -v "^EXPLAIN" \
 | grep -v ^query \
 | grep -v ^BEGIN \
 | grep -v ^COMMIT \
 | grep -v ^FLUSH \
 | grep -v ^LOCK \
 | grep -v ^UNLOCK \
 | grep -v mysql > queries_for_vtexplain.txt

3. Populate fake values for your queries #

Once the queries are normalized in prepared statement style, populate fake values to allow VTexplain to run properly. This is because vtexplain operates only on concrete (or un-normalized) queries. Doing this by textual substitution is shown below and typically requires some trial and error. An alternative is to use a MySQL monitoring tool. This tool sometimes has a feature where it can provide one concrete query example for every normalized query form, which is ideal for this purpose.

If you need to use textual substitution to obtain your concrete queries, the following is an example pipeline you can run:

cat queries.txt \
 | perl -p -e 's#\? = \?#1 = 1#g' \
 | perl -p -e 's#= \?#="1"#g' \
 | perl -p -e 's#LIMIT \?#LIMIT 1#g' \
 | perl -p -e 's#\> \?#> "1"#g' \
 | perl -p -e 's#IN \(\?\)#IN (1)#g' \
 | perl -p -e 's#\? AS ONE#1 AS ONE#g' \
 | perl -p -e 's#BINARY \?#BINARY \"1\"#g' \
 | perl -p -e 's#\< \?#< "2"#g' \
 | perl -p -e 's#, \?#, "1"#g' \
 | perl -p -e 's#VALUES \(...\)#VALUES \(1,2\)#g' \
 | perl -p -e 's#IN \(\.\.\.\)#IN \(1,2\)#g' \
 | perl -p -e 's#\- \? #\- 50 #g' \
 | perl -p -e 's#BETWEEN \? AND \?#BETWEEN 1 AND 10#g' \
 | perl -p -e 's#LIKE \? #LIKE \"1\" #g' \
 | perl -p -e 's#OFFSET \?#OFFSET 1#g' \
 | perl -p -e 's#\?, \.\.\.#\"1\", \"2\"#g' \
 | perl -p -e 's#\/ \? #\/ \"1\" #g' \
 | perl -p -e 's#THEN \? ELSE \?#THEN \"2\" ELSE \"3\"#g' \
 | perl -p -e 's#THEN \? WHEN#THEN \"4\" WHEN#g' \
 | perl -p -e 's#SELECT \? FROM#SELECT \"6\" FROM#g' \
 | perl -p -e 's#SELECT \?  AS#SELECT id AS#g' \
 | perl -p -e 's#\`DAYOFYEAR\` \(\?\)#DAYOFYEAR \("2020-01-20"\)#g' \
 | perl -p -e 's#YEAR \(\?\)#YEAR \("2020-01-01"\)#g' \
 | grep -v mysql > queries_for_vtexplain.txt

4. Run the VTexplain tool via a script #

In order to analyze every query in your list, create and run a script. The following is an example Python script that assumes a sharded database with 4 shards. You can adjust this script to match your individual requirements.

$ cat testfull.py
for line in open("queries_for_vtexplain.txt", "r").readlines():
    sql = line.strip()
    print("vtexplain -schema-file schema.sql -vschema-file vschema.json -shards 4 -sql '%s'" % sql)
x
$ python testfull.py > run_vtexplain.sh

An alternative method is to use the -sql-file option for vtexplain to pass the whole file to a single vtexplain invocation. This is much more efficient, but we have found that it can be easier to find errors if you perform one vtexplain invocation per SQL query.

If you choose to use the single invocation, it would look something like:

$ vtexplain -schema-file schema.sql -vschema-file vschema.json -shards 4 -sql-file queries_for_vtexplain.txt

5. Add your SQL schema to the output file #

Add your proposed SQL schema to the file created by the script (e.g. schema.sql). The following is an example SQL schema:

$ cat schema.sql
CREATE TABLE `user` (
  `user_id` bigint(20) NOT NULL,
  `name` varchar(128) DEFAULT NULL,
  `balance` decimal(13,4) DEFAULT NULL,
  PRIMARY KEY (`user_id`),
  KEY `balance` (`balance`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

6. Add your VSchema #

Add your VSchema to the file created by the script: in this example, the file is named schema.json. The following is an example VSchema to match the example SQL schema above.

$ cat vschema.json
{
    "ks1": {
        "sharded": true,
        "vindexes": {
            "hash": {
                "type": "hash"
            }
        },
        "tables": {
            "user": {
                "column_vindexes": [
                    {
                        "column": "user_id",
                        "name": "hash"
                    }
                ]
            }
        }
    }
}

Note that unlike the VSchema used in Vitess, e.g. in vtctlclient GetVSchema and vtctlclient ApplyVSchema, the format required by vtexplain differs slightly. There is an extra level of JSON objects at the top-level of the JSON format to allow you to have a single file that represents the VSchema for multiple Vitess keyspaces. In the above example, there is just a single keyspace called ks1.

7. Run the VTexplain tool and capture the output #

This step will generate the output you need to analyze to determine what queries may have issues with your proposed VSchema. It may take a long time to finish if you have a number of queries.

$ sh -x run_vtexplain.sh 2> vtexplain.output

8. Check your output #

Once you have your full output in vtexplain.output, use grep to search for the string "ERROR" to review any issues that VTExplain found.

Example: Scatted across shards #

In the following example, VTGate scatters the example query across both shards, and then aggregates the query results.

$ vtexplain -schema-file schema.sql -vschema-file vschema.json -shards 2 -sql 'SELECT * FROM user;'
----------------------------------------------------------------------
SELECT * FROM user

1 ks1/-40: SELECT * FROM user limit 10001
1 ks1/40-80: SELECT * FROM user limit 10001
1 ks1/80-c0: SELECT * FROM user limit 10001
1 ks1/c0-: SELECT * FROM user limit 10001
----------------------------------------------------------------------

This is not an error, but illustrates a few things about the query:

The query of this type will be scattered across all 4 the shards, given the schema and VSchema.
The phases of the scatter operation will occur in parallel. This is because the number 1 on the left-hand-side of the output indicates the ordering of the operations in time. The same number indicates parallel processing.
The implicit Vitess row limit of 10000 rows is also seen, even though that was not present in the original query.

Example: Query returns an error #

The following query produces an error because Vitess does not support the AVG function for scatter queries across multiple shards.

$ vtexplain -schema-file schema.sql -vschema-file vschema.json -shards 4 -sql 'SELECT AVG(balance) FROM user;'
ERROR: vtexplain execute error in 'SELECT AVG(balance) FROM user': unsupported: in scatter query: complex aggregate expression

Example: Targeting a single shard #

The following query only targets a single shard because the query supplies the sharding key.

$ vtexplain -schema-file schema.sql -vschema-file vschema.json -shards 2 -sql 'SELECT * FROM user WHERE user_id = 100;'
----------------------------------------------------------------------
SELECT * FROM user WHERE user_id = 100

1 ks1/80-c0: SELECT * FROM user WHERE user_id = 100 limit 10001
----------------------------------------------------------------------

<< Analyzing a SQL statement