pt-osc makes database unusable. It seems do not honor the --max-load parameter

Ostico · March 15, 2016, 2:12pm

Some related informations to explain better the strange behaviour i obtained:

Running: 5.6.29-log MySQL Community Server (GPL)

Some configuration params: [INDENT]innodb_file_per_table = 1
innodb-buffer-pool-size = 60G
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb-flush-log-at-trx-commit = 2

gtid_mode = on
enforce_gtid_consistency = true
log_slave_updates=true[/INDENT] Server params: [INDENT]70 GB RAM
16 Cores
Raid10 SSD ( iops approx 20k/s )[/INDENT] I have also a replica for data recovery and failover on which queries are never executed ( only replication thread ) on which for this test i stopped the replication.

i have a table wth

PRIMARY KEY (id_segment,id_job),
…
PARTITION BY RANGE (id_job)
SUBPARTITION BY HASH (id_job)
(PARTITION PART0 VALUES LESS THAN (160000)
(SUBPARTITION PART_01 ENGINE = InnoDB,
SUBPARTITION PART_02 ENGINE = InnoDB,
SUBPARTITION PART_03 ENGINE = InnoDB,
SUBPARTITION PART_04 ENGINE = InnoDB),
…



Occasionally this table receives spikes of high load, bulk inserts of 100 tuples:

Ex:

INSERT INTO tbl_name
(a,b,c)
VALUES
(1,1,1),
(2,2,2),
…
(100,100,100);



and after that phase it receives also an high number of threads connected which update single rows on the primary key with no row concurrency. ( max 150 updates/s ).

There are moreover approx. 100/200 users doing select and update.

Under normal behaviour and load this is the Threads_running:

[LIST]
[]2016-03-15 17:08:59 - Threads_running 3
[]2016-03-15 17:09:01 - Threads_running 3
[]2016-03-15 17:09:03 - Threads_running 2
[]2016-03-15 17:09:05 - Threads_running 7
[]2016-03-15 17:09:07 - Threads_running 3
[]2016-03-15 17:09:09 - Threads_running 2
[]2016-03-15 17:09:11 - Threads_running 3
[]2016-03-15 17:09:13 - Threads_running 5
[/LIST]






pt-osc stop executing around 20% of it's work because it exceed the Threads_running default ( 50 ). I tried to raise such value to 100, the results are pretty the same but the crash happened later.

So i tried to launch it with --max-load parameter lower than the default ( 15 ) and i set --critical-load to 400 ( insane ) because after some tries i found that this value is never reached.



and the results under normal load are:

[LIST]
[]2016-03-15 17:08:05 - Threads_running 3
[]2016-03-15 17:08:07 - Threads_running 73
[]2016-03-15 17:08:09 - Threads_running 126
[]2016-03-15 17:08:11 - Threads_running 152
[]2016-03-15 17:08:13 - Threads_running 158
[]2016-03-15 17:08:15 - Threads_running 164
[]2016-03-15 17:08:17 - Threads_running 170
[]2016-03-15 17:08:19 - Threads_running 174
[]2016-03-15 17:08:21 - Threads_running 178
[]2016-03-15 17:08:23 - Threads_running 183
[]2016-03-15 17:08:25 - Threads_running 141
[]2016-03-15 17:08:27 - Threads_running 120
[]2016-03-15 17:08:29 - Threads_running 141
[]2016-03-15 17:08:31 - Threads_running 148
[/LIST]



The applications becomes obviously unresponsive and MySQL goes under heavy load.

So, i tried an experiment sending a SIGINT to pt-osc and there are the results:

[LIST]
[]^C# Exiting on SIGINT.
[]Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_del;
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_upd;
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_ins;
Not dropping the new table db_name._seg_t_new because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS db_name._seg_t_new;
db_name.seg_t was not altered.
=== Command terminated with exit status 1 (Tue Mar 15 17:08:58 2016) ===
[]2016-03-15 17:08:59 - Threads_running 3
[]2016-03-15 17:09:01 - Threads_running 3
[*]2016-03-15 17:09:03 - Threads_running 2
[/LIST]



This let me think that the triggers are not the problem. Seems that the issue is in the queries the the tool executes on copy data from a table to the other one.
So i though that the value of --max-load is not honored in some manner.

Any idea/suggestion on how i can perform this alter table?
Can i improve the command parameters to lower the load?
Some parameter in the server configuration are wrong?

Tell me if you need more info.

Thank you in advance.

Ostico · March 24, 2016, 6:40am

I found the issue. It worked ( hard ) after i lowered the mean load on the database.

I suppose it was a memory swap and/or a buffer load/unload of MySQL when it load/unload indexes of these large partitions while the application continues to write on the newest partitions. When the tool has begun to write data inside the more little new partitions, the load was much lower.

84 hours of alter table in total.

Closed??

Topic		Replies	Views
pt-online-schema-change crashed mysql within 5 minutes of starting job Percona Toolkit	1	2388	November 17, 2015
Pt online schema change is very slow on prod server General Questions	8	2133	January 27, 2022
Pt-online-schema-change showing this error " exceeds its critical threshold 50" Percona Toolkit	3	12645	September 30, 2013
Pt-online-schema-change MySQL & MariaDB	4	1132	May 31, 2023
any considerations before updating large table with pt-osc? Percona Toolkit	4	878	May 19, 2020

pt-osc makes database unusable. It seems do not honor the --max-load parameter

Related topics