We came across an issue with Tungsten replicator where it was incorrectly interpreting multi-byte characters. After some deeper investigation, it looks like it’s not an issue with Tungsten, but with the way data is being written to the binlogs by MySQL. The following illustrates the problem:
I created a table in mysql (Percona 5.6). The default character set of the server is latin1. It has row level logging.
create table mt_test(machine_name
varchar(64) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL) default charset=utf8;
I did a setnames and ran an insert of a right-hand upper quote (unicode hex=E28099):
mysql> set names utf8;
mysql> insert into mt_test select machine_name from play_wide where id=13130;
Here we confirm the value of the quote:
mysql> select machine_name, hex(substring(machine_name,8,1)) from mt_test;
±---------------------------------±---------------------------------+
| machine_name | hex(substring(machine_name,8,1)) |
±---------------------------------+
| Treessa’s MacBook Pro |E28099 |
±---------------------------------+ ----------------------------------+
For reference:
mysql> show variables like ‘%char%’;
±-------------------------±---------------------------+
| Variable_name | Value |
±-------------------------±---------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
±-------------------------±---------------------------+
But in the binlog I see the character misinterpreted as though it were latin1 where the unicode bytes get split into 3 separate characters:
/!\C utf8 //!/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/!/;
BEGIN
/!/;
at 1010181
#170317 19:13:39 server id 18913 end_log_pos 1010239 CRC32 0x061bfc35 Table_map: usage_data
.mt_test
mapped to number 96
at 1010239
#170317 19:13:39 server id 18913 end_log_pos 1010299 CRC32 0xe903ce6f Write_rows: table id 96 flags: STMT_END_F
INSERT INTO usage_data
.mt_test
SET
@1=‘Treessaâ<80><99>s MacBook Pro’
I’ve tried various permutations of configuration settings and table settings, but can’t seem to make the log write the correct value. Note that the OS locale is UTF8. Any ideas of how this can be controlled?