Character set support

Supported character sets and configuration

Overview #

Ideally all textual data should be unicode. The best character set to be used today is utf8mb4 (4 byte UTF8), the successor to utf8. However, legacy systems may carry non-UTF characters sets, specific to European, Chinese, or other languages.

VReplication supports copying & streaming across multiple character sets. Moreover, it supports conversion from one character set to another. An important use case is importing from an external data source that uses non-UTF8 encoding, into a UTF8-encoded Vitess cluster.

Unless told otherwise, VReplication assumes the stream’s source and target both use trivial character sets that do not require any special encodings. These are:

  • utf8
  • utf8mb4
  • ascii
  • binary

To be able to work with other character sets:

  • Verify VReplication supports the specific character sets.
  • VReplication needs to be told how which character sets it’s converting from/to.

Supported character sets #

The list of supported character sets is dynamic and may grow. You will find it under CharacterSetEncoding in https://github.com/vitessio/vitess/blob/main/go/mysql/constants.go

The current list of supported character sets/encodings is:

  • ascii
  • binary
  • cp1250
  • cp1251
  • cp1256
  • cp1257
  • cp850
  • cp852
  • cp866
  • gbk
  • greek
  • hebrew
  • koi8r
  • latin1
  • latin2
  • latin5
  • latin7
  • utf8
  • utf8mb4

Converting/encoding #

  • In VRecpliation’s filter query, make sure to convert all non-trivial character sets to UTF like so:
select ..., convert(column_name using utf8mb4) as column_name, ...
  • In VReplication’s rule, add one or more convert_charset entries. Each entry is of the form:
convert_charset:{key:"<column_name>" value:{from_charset:"<charset_name>" to_charset:"<charset_name>"}}

Example #

In this simplified example, we wish to stream from this source table:

create table source_names (
  id int,
  name varchar(64) charset latin1 collate latin1_swedish_ci,
  primary key(id)
)

And into this target table:

create table target_names (
  id int,
  name varchar(64) charset utf8mb4,
  primary key(id)
)

Note that we wish to convert column name from latin1 to utf8mb4.

The rule would looks like this:

keyspace:"commerce" shard:"0" filter:{
  rules:{
    match:"target_names" 
    filter:"select `id` as `id`, convert(`name` using utf8mb4) as `name` from `source_names`" 
    convert_charset:{key:"name" value:{from_charset:"latin1" to_charset:"utf8mb4"}}
  }
}

Internal notes #

Right now to_charset is not actually used in the code. The write works correctly whether to_charset is specified or not, and irrespective of its value. It “just works”" because the data gets encoded from a utf8 in Go-plane, via MySQL connector and onto the specific column. However, future implementations may require explicit definition of to_charset.

As for the filter query, right now it’s the user’s responsibility to identify non-UTF columns in the source table. In the future, Vitess should be able to auto detect those, and automatically select convert(col_name using utf8mb4) as col_name.


Character set support