We are upgrading from MySQL 5.7 to MySQL 8. As part of the process, we would like to change utf8 (utf8mb3) to utt8mb4 collation utf8mb4_0900_ai_ci. There are two ways to do it:
- ALTER TABLE <table_name> CONVERT TO CHARACTER SET <charset_name> COLLATE <collation_name> ;
- ALTER TABLE <table_name> CHARACTER SET <charset_name> COLLATE <collation_name>;
Current understanding is that the (1) will physically convert the character sets to fit into the new character number of bytes (utf8mb3 → 3 bytes, utf8mb4 → 4 bytes) and collation while (2) will merely set character bytes usage (character set) and ordering (collation) for new rows created or updated but the existing rows will not be touched.
Is this understanding correct?
Hello @Venkat welcome back,
Thank you for sharing this observation and you’re right. The easy answer is “Yes” but let me write my understanding derived from mysql documentation.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET <charset_name> COLLATE <collation_name>
will physically convert the character sets of the table and its columns to fit into the new character set’s number of bytes. It will also change the collation of all character columns.
ALTER TABLE <table_name> CHARACTER SET <charset_name> COLLATE <collation_name>
will only set the character set and collation for new rows that are created or updated in the table. It will not touch the existing rows’ character set or collation. Note that “DEFAULT” is implicit here and that’s why it just affects to new changes.
In addition to that,
we can also note that if you specify
CONVERT TO CHARACTER SET binary, the CHAR, VARCHAR, and TEXT columns will be converted to their corresponding binary string types (BINARY, VARBINARY, BLOB). This is important to ensure that the new column is long enough to store as many characters as the original column.
To avoid data type changes during character set conversion, it is recommended to use the
MODIFY clause instead of
CONVERT TO CHARACTER SET for individual columns. This allows you to change the character set and collation without altering the data type.
Thank you for confirming and the additional information. I didn’t know of the data type change when using convert. I then tried this on one table. I see the TEXT type columns were changed to MEDIUMTEXT.type columns but then it didn’t change the character set or collation for those types. Is this data type change is what you meant?
However, when I attempted alter table modify character set collate on another table with same structure. I see it did not change the data type of TEXT column but it did change the character set and collation for this column.
One additional think I noticed is that if we did alter table…character set … collate…;, it works and will have new character set and collation for future rows. However the metadata in the information_schema still holds the original character set and collation for the table columns. This will mislead people looking into information_schema for table/column definition. If we do the character set and collation at the column level, exactly as you specified, it correctly retains the data type and also reflects in the metadata.
So here is the summary:
- alter table convert – physically changes data but risk of changing data types, also exceeding the size unless using binary version (String) as you suggested.
- alter table character set – No change to physical data, but future data will be stored this way. The metadata in the information_schema does not reflect the character set change. Hence this must be avoided
- alter table modify column character set – Preferred way. Retains data type, updates the metadata for the character set, collation change, will no change existing physical data. The only caveat is to generate the alter table modify column statements as pretty much all column properties will have to be reconstructed from the metadata (type, size/scale, precision, NULL/NOT NULL, uniqueness etc).