Character set conversion at the table level

I have a database on latin1 historically. All tables are on utf8 but for one table which is on latin1. I need to convert this to utf8. There are two approaches. As per MySQL doc, both are blockers of DML operation. The second approach can be inplace and faster but is a DML blocker. The first approach can work only with COPY and not INPLACE. These tests were done in Aurora MySQL 5.7.

  1. ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name ;
  2. ALTER TABLE t1 CHARACTER SET = utf8;

I tried both methods on two different clones. While the convert is a long process (1.5 hours on a 35M rows table in RDS), it worked correctly when we updated an existing value to utf8. The one with the second approach, worked instantaneously (0.12 secs), understandably as it only updated the metadata, it didn’t seem to do the right thing. When I updated an existing value with utf8, it produced garbage. What’s missing? latin1 is a subset of utf8, so the second approach should have worked. With the table definition having utf8, the future values be correctly saved? Doesn’t seem to be so.

Additionally:
if first method is the only right approach, what’s the purpose of the second method? Is it to be used only on a newly created table?
if all characters are in latin1 (1 byte) which is subset of utf8 (3 bytes), what does the convert do? Even in utf8, it will still store it in 1 byte (being originally single byte) so what does convert physically do?

1 Like

Hi Venkat, I suggest you consider gh-ost or pt-online-schema-change tools to convert the tables to utf8mb4. That way you won’t block DML for the duration of the conversion.

1 Like

Thanks igroene. This has downstream binlog based external replication within AWS. Not sure what impact it has. But really curious when would the second method be used? Seems like even for future data, it does not honor the new character set.

1 Like

Re: downstream replication you’d need to run some tests to make sure nothing breaks. It’s hard to say without knowing the details.
About the second method, that is supposed to change the default character set for the table. I am not sure what is the effect if you update a column in-place but I wouldn’t mix data with 2 different charsets on the same table even if it was possible. I believe it might be meant to be used with an empty table only.

1 Like

Yeah, that’s what I thought too. Interestingly AWS support (not professional services) also recommended second approach. But well, with direct experience, I know now. :slightly_smiling_face:

Thank you!

1 Like

Hi Igroene,

I figured with a little test. The second method merely changes metadata, but only for future new columns (not values). So if a new column is added that would take utf8 as default character set. That explains why it won’t do the right thing as opposed to the first method.

This table was originally latin1, applied method 2. Now the table shows utf8, however the original column is now latin1 and the new column added after method 2 takes utf8. Mystery solved!

| mytest | CREATE TABLE mytest (
a varchar(100) CHARACTER SET latin1 DEFAULT NULL,
b varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Thanks.

1 Like

Thanks for posting this info, good to know!

1 Like