Aurora MySQL8: Using pt-online-schema-change to change column type bigint consumes all RAM and instance crashes

We have a very large InnoDB table that our application uses for logging. It’s an integral part of our application so it requires maintenance (we can’t refactor in the short term).

Table specs:
Number of rows: 871.49 million
Data: 872.37G
Index: 588.60G
Total Size: 1460.96G

We are trying to convert the id auto_increment INT column (primary key) to BIGINT on our standalone Development Aurora MySQL server before attempting on prod.

Dev Aurora MySQL:
Standalone
8.0.mysql_aurora.3.03.2
Instance Class: r6g.xlarge
vCPU: 4
RAM: 32GB

Prod Aurora MySQL:
1 Writer, 3 Readers
8.0.mysql_aurora.3.03.2
Instance Class: r6g.4xlarge
vCPU: 16
RAM: 128GB

The table in Dev is comparable to the table in Prod that is fed live data.

From going through the InnoDB logs over the days before and during the run up to the crash, I saw the following changes in the InnoDB related server variables.

Variable_name 13-Feb 14-Feb 15-Feb 16-Feb
Innodb_data_read 424,669,511,680 521,605,955,584 1,325,900,000,000 2,143,511,166,976
Innodb_buffer_pool_read_requests 1,739,278,367 6,113,415,351 10,143,000,000 12,956,725,512
Innodb_buffer_pool_write_requests 39,782,681 907,451,157 1,843,782,860 2,535,533,175
Innodb_rows_read 1,337,297,480 1,980,224,379 2,374,903,120 2,569,943,643
Innodb_rows_inserted 16,207,281 84,650,132 172,602,958 230,227,039
Innodb_pages_read 25,920,094 31,836,773 80,926,708 130,830,113
Innodb_buffer_pool_reads 1,493,007 5,806,384 52,082,385 101,936,041
Innodb_aurora_shm_read_requests 1,535,188 5,818,343 51,608,251 100,874,765
Innodb_pages_created 254,358 4,747,483 10,178,795 14,049,427
Innodb_system_rows_read 5,205,164 8,052,860 10,568,190 13,685,090
Innodb_logical_read_ahead_page_count 26,085,663 30,172,845 33,562,159 33,719,920
Innodb_rows_updated 597,391 4,325,410 4,325,586 4,325,645
Innodb_aurora_batched_read_requests 3,271,874 3,503,060 4,132,259 4,432,683
Innodb_aurora_shm_batched_read_requests 3,271,872 3,503,058 4,132,256 4,432,680
Innodb_system_rows_updated 87,913 131,890 176,091 218,020
Innodb_buffer_pool_bytes_dirty 2,473,984 2,375,680 2,375,680 2,408,448
Innodb_buffer_pool_bytes_data 20,941,979,648 20,942,012,416 20,942,000,000 20,942,028,800
Innodb_data_reads 189 1,090 5,650 5,650
Innodb_data_writes 189 1,090 5,650 5,650
Innodb_system_rows_inserted 7,697 9,157 10,733 11,791
Innodb_system_rows_deleted 6,986 8,217 9,634 10,618
Innodb_buffer_pool_read_ahead_evicted 5,741,504 5,741,888 5,741,969 5,741,969
Innodb_rows_deleted 21,470 21,885 21,893 21,907
Innodb_num_open_files 2,214 2,223 2,221 2,318
Innodb_row_lock_time 95 99 106 128
Innodb_row_lock_waits 8 11 13 15
Innodb_buffer_pool_pages_dirty 151 145 145 147
Innodb_buffer_pool_pages_data 1,278,197 1,278,199 1,278,200 1,278,200

Right now I’m assuming that MySQL is loading every query result into the buffer pool until the system crashes with OOM. So far I’ve experimented with drastically lowering Threads_running for critical-load and max-load, that helped with keeping the CPU from spiking, but still hit an OOM after a couple of days. Then I set chunk-size=1000 and chunk-time=0.5. Same thing.

I could add a reader node, but it seems to be the reading from the table that is consuming RAM. Would increasing the chunk-time help with not overwhelming the buffer pool system?

Any suggestions? Let me know if you would like me to clarify anything.

This is not very much ram. At 80%, your buffer pool is only 25GB. The remaining 7GB could easily be filled by other queries doing scans, sorts, joins, etc, or even the OS doing stuff. Typically the OOM killer in the linux kernel kills MySQL once it starts using too much RAM outside the buffer pool. I would increase this dev instance to 64GB, which should create a buffer pool of 51GB, leaving 13GB for everything else, nearly 2x “memory buffer” against OOM.

Thanks, we will look into that for our options.

Can you make any recommendations about chunk-size and chunk-time?

Never really changed them myself. The purpose of chunk-time is to allow the chunk sizes to grow/shrink as load decreases/increases, respectively, on the MySQL server allowing you to get the job done faster.
A static chunk-size may be more predictable, but will take longer overall.