Ideally all textual data should be unicode. The best character set to be used today is utf8mb4 (4 byte UTF8), the successor to utf8. However, legacy systems may carry non-UTF characters sets, specific to European, Chinese, or other languages.
VReplication supports copying & streaming across multiple character sets. Moreover, it supports conversion from one character set to another. An important use case is importing from an external data source that uses non-UTF8 encoding, into a UTF8-encoded Vitess cluster.
Unless told otherwise, VReplication assumes the stream’s source and target both use trivial character sets that do not require any special encodings. These are:
To be able to work with other character sets:
Verify VReplication supports the specific character sets.
VReplication needs to be told how which character sets it’s converting from/to.
Right now to_charset is not actually used in the code. The write works correctly whether to_charset is specified or not, and irrespective of its value. It “just works”" because the data gets encoded from a utf8 in Go-plane, via MySQL connector and onto the specific column. However, future implementations may require explicit definition of to_charset.
As for the filter query, right now it’s the user’s responsibility to identify non-UTF columns in the source table. In the future, Vitess should be able to auto detect those, and automatically select convert(col_name using utf8mb4) as col_name.