We have a very large InnoDB table that our application uses for logging. It’s an integral part of our application so it requires maintenance (we can’t refactor in the short term).
Table specs:
Number of rows: 871.49 million
Data: 872.37G
Index: 588.60G
Total Size: 1460.96G
We are trying to convert the id auto_increment INT column (primary key) to BIGINT on our standalone Development Aurora MySQL server before attempting on prod.
Dev Aurora MySQL:
Standalone
8.0.mysql_aurora.3.03.2
Instance Class: r6g.xlarge
vCPU: 4
RAM: 32GB
Prod Aurora MySQL:
1 Writer, 3 Readers
8.0.mysql_aurora.3.03.2
Instance Class: r6g.4xlarge
vCPU: 16
RAM: 128GB
The table in Dev is comparable to the table in Prod that is fed live data.
From going through the InnoDB logs over the days before and during the run up to the crash, I saw the following changes in the InnoDB related server variables.
Variable_name | 13-Feb | 14-Feb | 15-Feb | 16-Feb |
---|---|---|---|---|
Innodb_data_read | 424,669,511,680 | 521,605,955,584 | 1,325,900,000,000 | 2,143,511,166,976 |
Innodb_buffer_pool_read_requests | 1,739,278,367 | 6,113,415,351 | 10,143,000,000 | 12,956,725,512 |
Innodb_buffer_pool_write_requests | 39,782,681 | 907,451,157 | 1,843,782,860 | 2,535,533,175 |
Innodb_rows_read | 1,337,297,480 | 1,980,224,379 | 2,374,903,120 | 2,569,943,643 |
Innodb_rows_inserted | 16,207,281 | 84,650,132 | 172,602,958 | 230,227,039 |
Innodb_pages_read | 25,920,094 | 31,836,773 | 80,926,708 | 130,830,113 |
Innodb_buffer_pool_reads | 1,493,007 | 5,806,384 | 52,082,385 | 101,936,041 |
Innodb_aurora_shm_read_requests | 1,535,188 | 5,818,343 | 51,608,251 | 100,874,765 |
Innodb_pages_created | 254,358 | 4,747,483 | 10,178,795 | 14,049,427 |
Innodb_system_rows_read | 5,205,164 | 8,052,860 | 10,568,190 | 13,685,090 |
Innodb_logical_read_ahead_page_count | 26,085,663 | 30,172,845 | 33,562,159 | 33,719,920 |
Innodb_rows_updated | 597,391 | 4,325,410 | 4,325,586 | 4,325,645 |
Innodb_aurora_batched_read_requests | 3,271,874 | 3,503,060 | 4,132,259 | 4,432,683 |
Innodb_aurora_shm_batched_read_requests | 3,271,872 | 3,503,058 | 4,132,256 | 4,432,680 |
Innodb_system_rows_updated | 87,913 | 131,890 | 176,091 | 218,020 |
Innodb_buffer_pool_bytes_dirty | 2,473,984 | 2,375,680 | 2,375,680 | 2,408,448 |
Innodb_buffer_pool_bytes_data | 20,941,979,648 | 20,942,012,416 | 20,942,000,000 | 20,942,028,800 |
Innodb_data_reads | 189 | 1,090 | 5,650 | 5,650 |
Innodb_data_writes | 189 | 1,090 | 5,650 | 5,650 |
Innodb_system_rows_inserted | 7,697 | 9,157 | 10,733 | 11,791 |
Innodb_system_rows_deleted | 6,986 | 8,217 | 9,634 | 10,618 |
Innodb_buffer_pool_read_ahead_evicted | 5,741,504 | 5,741,888 | 5,741,969 | 5,741,969 |
Innodb_rows_deleted | 21,470 | 21,885 | 21,893 | 21,907 |
Innodb_num_open_files | 2,214 | 2,223 | 2,221 | 2,318 |
Innodb_row_lock_time | 95 | 99 | 106 | 128 |
Innodb_row_lock_waits | 8 | 11 | 13 | 15 |
Innodb_buffer_pool_pages_dirty | 151 | 145 | 145 | 147 |
Innodb_buffer_pool_pages_data | 1,278,197 | 1,278,199 | 1,278,200 | 1,278,200 |
Right now I’m assuming that MySQL is loading every query result into the buffer pool until the system crashes with OOM. So far I’ve experimented with drastically lowering Threads_running for critical-load and max-load, that helped with keeping the CPU from spiking, but still hit an OOM after a couple of days. Then I set chunk-size=1000 and chunk-time=0.5. Same thing.
I could add a reader node, but it seems to be the reading from the table that is consuming RAM. Would increasing the chunk-time help with not overwhelming the buffer pool system?
Any suggestions? Let me know if you would like me to clarify anything.