pt-osc makes database unusable. It seems do not honor the --max-load parameter

Some related informations to explain better the strange behaviour i obtained:

Running: 5.6.29-log MySQL Community Server (GPL)

Some configuration params: [INDENT]innodb_file_per_table = 1
innodb-buffer-pool-size = 60G
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb-flush-log-at-trx-commit = 2

gtid_mode = on
enforce_gtid_consistency = true
log_slave_updates=true[/INDENT] Server params: [INDENT]70 GB RAM
16 Cores
Raid10 SSD ( iops approx 20k/s )[/INDENT] I have also a replica for data recovery and failover on which queries are never executed ( only replication thread ) on which for this test i stopped the replication.

i have a table wth

PRIMARY KEY (id_segment,id_job),

PARTITION BY RANGE (id_job)
SUBPARTITION BY HASH (id_job)
(PARTITION PART0 VALUES LESS THAN (160000)
(SUBPARTITION PART_01 ENGINE = InnoDB,
SUBPARTITION PART_02 ENGINE = InnoDB,
SUBPARTITION PART_03 ENGINE = InnoDB,
SUBPARTITION PART_04 ENGINE = InnoDB),



Occasionally this table receives spikes of high load, bulk inserts of 100 tuples:

Ex:

INSERT INTO tbl_name
(a,b,c)
VALUES
(1,1,1),
(2,2,2),

(100,100,100);



and after that phase it receives also an high number of threads connected which update single rows on the primary key with no row concurrency. ( max 150 updates/s ).

There are moreover approx. 100/200 users doing select and update.

Under normal behaviour and load this is the Threads_running:


[LIST]
[]2016-03-15 17:08:59 - Threads_running 3
[
]2016-03-15 17:09:01 - Threads_running 3
[]2016-03-15 17:09:03 - Threads_running 2
[
]2016-03-15 17:09:05 - Threads_running 7
[]2016-03-15 17:09:07 - Threads_running 3
[
]2016-03-15 17:09:09 - Threads_running 2
[]2016-03-15 17:09:11 - Threads_running 3
[
]2016-03-15 17:09:13 - Threads_running 5
[/LIST]






pt-osc stop executing around 20% of it's work because it exceed the Threads_running default ( 50 ). I tried to raise such value to 100, the results are pretty the same but the crash happened later.

So i tried to launch it with --max-load parameter lower than the default ( 15 ) and i set --critical-load to 400 ( insane ) because after some tries i found that this value is never reached.



and the results under normal load are:


[LIST]
[]2016-03-15 17:08:05 - Threads_running 3
[
]2016-03-15 17:08:07 - Threads_running 73
[]2016-03-15 17:08:09 - Threads_running 126
[
]2016-03-15 17:08:11 - Threads_running 152
[]2016-03-15 17:08:13 - Threads_running 158
[
]2016-03-15 17:08:15 - Threads_running 164
[]2016-03-15 17:08:17 - Threads_running 170
[
]2016-03-15 17:08:19 - Threads_running 174
[]2016-03-15 17:08:21 - Threads_running 178
[
]2016-03-15 17:08:23 - Threads_running 183
[]2016-03-15 17:08:25 - Threads_running 141
[
]2016-03-15 17:08:27 - Threads_running 120
[]2016-03-15 17:08:29 - Threads_running 141
[
]2016-03-15 17:08:31 - Threads_running 148
[/LIST]



The applications becomes obviously unresponsive and MySQL goes under heavy load.

So, i tried an experiment sending a SIGINT to pt-osc and there are the results:


[LIST]
[]^C# Exiting on SIGINT.
[
]Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_del;
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_upd;
DROP TRIGGER IF EXISTS db_name.pt_osc_db_name_seg_t_ins;
Not dropping the new table db_name._seg_t_new because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS db_name._seg_t_new;
db_name.seg_t was not altered.
=== Command terminated with exit status 1 (Tue Mar 15 17:08:58 2016) ===
[]2016-03-15 17:08:59 - Threads_running 3
[
]2016-03-15 17:09:01 - Threads_running 3
[*]2016-03-15 17:09:03 - Threads_running 2
[/LIST]



This let me think that the triggers are not the problem. Seems that the issue is in the queries the the tool executes on copy data from a table to the other one.
So i though that the value of --max-load is not honored in some manner.

Any idea/suggestion on how i can perform this alter table?
Can i improve the command parameters to lower the load?
Some parameter in the server configuration are wrong?

Tell me if you need more info.

Thank you in advance.

I found the issue. It worked ( hard ) after i lowered the mean load on the database.

I suppose it was a memory swap and/or a buffer load/unload of MySQL when it load/unload indexes of these large partitions while the application continues to write on the newest partitions. When the tool has begun to write data inside the more little new partitions, the load was much lower.

84 hours of alter table in total.

Closed?? :slight_smile: