Aurora MySQL8: Using pt-online-schema-change to change column type bigint consumes all RAM and instance crashes

Galen_Spikes · February 27, 2024, 3:14pm

We have a very large InnoDB table that our application uses for logging. It’s an integral part of our application so it requires maintenance (we can’t refactor in the short term).

Table specs:
Number of rows: 871.49 million
Data: 872.37G
Index: 588.60G
Total Size: 1460.96G

We are trying to convert the id auto_increment INT column (primary key) to BIGINT on our standalone Development Aurora MySQL server before attempting on prod.

Dev Aurora MySQL:
Standalone
8.0.mysql_aurora.3.03.2
Instance Class: r6g.xlarge
vCPU: 4
RAM: 32GB

Prod Aurora MySQL:
1 Writer, 3 Readers
8.0.mysql_aurora.3.03.2
Instance Class: r6g.4xlarge
vCPU: 16
RAM: 128GB

The table in Dev is comparable to the table in Prod that is fed live data.

From going through the InnoDB logs over the days before and during the run up to the crash, I saw the following changes in the InnoDB related server variables.

Variable_name	13-Feb	14-Feb	15-Feb	16-Feb
Innodb_data_read	424,669,511,680	521,605,955,584	1,325,900,000,000	2,143,511,166,976
Innodb_buffer_pool_read_requests	1,739,278,367	6,113,415,351	10,143,000,000	12,956,725,512
Innodb_buffer_pool_write_requests	39,782,681	907,451,157	1,843,782,860	2,535,533,175
Innodb_rows_read	1,337,297,480	1,980,224,379	2,374,903,120	2,569,943,643
Innodb_rows_inserted	16,207,281	84,650,132	172,602,958	230,227,039
Innodb_pages_read	25,920,094	31,836,773	80,926,708	130,830,113
Innodb_buffer_pool_reads	1,493,007	5,806,384	52,082,385	101,936,041
Innodb_aurora_shm_read_requests	1,535,188	5,818,343	51,608,251	100,874,765
Innodb_pages_created	254,358	4,747,483	10,178,795	14,049,427
Innodb_system_rows_read	5,205,164	8,052,860	10,568,190	13,685,090
Innodb_logical_read_ahead_page_count	26,085,663	30,172,845	33,562,159	33,719,920
Innodb_rows_updated	597,391	4,325,410	4,325,586	4,325,645
Innodb_aurora_batched_read_requests	3,271,874	3,503,060	4,132,259	4,432,683
Innodb_aurora_shm_batched_read_requests	3,271,872	3,503,058	4,132,256	4,432,680
Innodb_system_rows_updated	87,913	131,890	176,091	218,020
Innodb_buffer_pool_bytes_dirty	2,473,984	2,375,680	2,375,680	2,408,448
Innodb_buffer_pool_bytes_data	20,941,979,648	20,942,012,416	20,942,000,000	20,942,028,800
Innodb_data_reads	189	1,090	5,650	5,650
Innodb_data_writes	189	1,090	5,650	5,650
Innodb_system_rows_inserted	7,697	9,157	10,733	11,791
Innodb_system_rows_deleted	6,986	8,217	9,634	10,618
Innodb_buffer_pool_read_ahead_evicted	5,741,504	5,741,888	5,741,969	5,741,969
Innodb_rows_deleted	21,470	21,885	21,893	21,907
Innodb_num_open_files	2,214	2,223	2,221	2,318
Innodb_row_lock_time	95	99	106	128
Innodb_row_lock_waits	8	11	13	15
Innodb_buffer_pool_pages_dirty	151	145	145	147
Innodb_buffer_pool_pages_data	1,278,197	1,278,199	1,278,200	1,278,200

Right now I’m assuming that MySQL is loading every query result into the buffer pool until the system crashes with OOM. So far I’ve experimented with drastically lowering Threads_running for critical-load and max-load, that helped with keeping the CPU from spiking, but still hit an OOM after a couple of days. Then I set chunk-size=1000 and chunk-time=0.5. Same thing.

I could add a reader node, but it seems to be the reading from the table that is consuming RAM. Would increasing the chunk-time help with not overwhelming the buffer pool system?

Any suggestions? Let me know if you would like me to clarify anything.

matthewb · February 27, 2024, 4:11pm

This is not very much ram. At 80%, your buffer pool is only 25GB. The remaining 7GB could easily be filled by other queries doing scans, sorts, joins, etc, or even the OS doing stuff. Typically the OOM killer in the linux kernel kills MySQL once it starts using too much RAM outside the buffer pool. I would increase this dev instance to 64GB, which should create a buffer pool of 51GB, leaving 13GB for everything else, nearly 2x “memory buffer” against OOM.

Galen_Spikes · February 27, 2024, 4:28pm

Thanks, we will look into that for our options.

Can you make any recommendations about chunk-size and chunk-time?

matthewb · February 27, 2024, 5:21pm

Never really changed them myself. The purpose of chunk-time is to allow the chunk sizes to grow/shrink as load decreases/increases, respectively, on the MySQL server allowing you to get the job done faster.
A static chunk-size may be more predictable, but will take longer overall.

Topic		Replies	Views
Pt-online-schema-change slow down Other MySQL® Questions	1	1134	November 4, 2021
Using pt-online-schema-change on Master (MySQL 5.5) - Slave (Amazon Aurora) setup Percona Toolkit	0	638	September 20, 2016
Pt-online-schema-change causes MariaDB/MySQL database to crash for very large tables Percona Distribution for MySQL	13	3222	October 21, 2023
Pt-online-schema-change stops after 24 hours on AWS Aurora Serverless V1 MySQL 5.7 Percona Toolkit	9	1265	September 15, 2023
pt-online-schema-change is not working right Percona Toolkit	2	1721	July 7, 2014

Aurora MySQL8: Using pt-online-schema-change to change column type bigint consumes all RAM and instance crashes

Related topics