Character set conversion at the table level

Venkat · September 10, 2021, 1:23am

I have a database on latin1 historically. All tables are on utf8 but for one table which is on latin1. I need to convert this to utf8. There are two approaches. As per MySQL doc, both are blockers of DML operation. The second approach can be inplace and faster but is a DML blocker. The first approach can work only with COPY and not INPLACE. These tests were done in Aurora MySQL 5.7.

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name ;
ALTER TABLE t1 CHARACTER SET = utf8;

I tried both methods on two different clones. While the convert is a long process (1.5 hours on a 35M rows table in RDS), it worked correctly when we updated an existing value to utf8. The one with the second approach, worked instantaneously (0.12 secs), understandably as it only updated the metadata, it didn’t seem to do the right thing. When I updated an existing value with utf8, it produced garbage. What’s missing? latin1 is a subset of utf8, so the second approach should have worked. With the table definition having utf8, the future values be correctly saved? Doesn’t seem to be so.

Additionally:
if first method is the only right approach, what’s the purpose of the second method? Is it to be used only on a newly created table?
if all characters are in latin1 (1 byte) which is subset of utf8 (3 bytes), what does the convert do? Even in utf8, it will still store it in 1 byte (being originally single byte) so what does convert physically do?

Ivan_Groenewold · September 10, 2021, 3:12pm

Hi Venkat, I suggest you consider gh-ost or pt-online-schema-change tools to convert the tables to utf8mb4. That way you won’t block DML for the duration of the conversion.

Venkat · September 10, 2021, 4:18pm

Thanks igroene. This has downstream binlog based external replication within AWS. Not sure what impact it has. But really curious when would the second method be used? Seems like even for future data, it does not honor the new character set.

Ivan_Groenewold · September 10, 2021, 4:34pm

Re: downstream replication you’d need to run some tests to make sure nothing breaks. It’s hard to say without knowing the details.
About the second method, that is supposed to change the default character set for the table. I am not sure what is the effect if you update a column in-place but I wouldn’t mix data with 2 different charsets on the same table even if it was possible. I believe it might be meant to be used with an empty table only.

Venkat · September 10, 2021, 7:10pm

Yeah, that’s what I thought too. Interestingly AWS support (not professional services) also recommended second approach. But well, with direct experience, I know now.

Thank you!

Venkat · September 11, 2021, 6:55pm

Hi Igroene,

I figured with a little test. The second method merely changes metadata, but only for future new columns (not values). So if a new column is added that would take utf8 as default character set. That explains why it won’t do the right thing as opposed to the first method.

This table was originally latin1, applied method 2. Now the table shows utf8, however the original column is now latin1 and the new column added after method 2 takes utf8. Mystery solved!

| mytest | CREATE TABLE mytest (
a varchar(100) CHARACTER SET latin1 DEFAULT NULL,
b varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Thanks.

Ivan_Groenewold · September 13, 2021, 11:25am

Thanks for posting this info, good to know!

Topic		Replies	Views
Change character set and collation Other MySQL® Questions	2	5591	July 4, 2023
Upgrade an existing mysql db on aws aurora from latin-1 to utf-8 MySQL & MariaDB mysql , percona	8	2140	December 3, 2022
MySQL 5.7 Upgrade to MySQL 8 Errors - Warning: The following objects use the utf8mb3 character set MySQL & MariaDB	3	3598	November 8, 2023
Converting only collation Other MySQL® Questions	3	1701	September 28, 2021
Conversion to utf8mb4 and replication Other MySQL® Questions	0	1593	July 18, 2014

Character set conversion at the table level

Related topics