pt-online-schema-change crashed mysql within 5 minutes of starting job

After having run ptosc in automated fashion I have probably run ptosc over 50,000 times SUCCESSFULLY so to date this bigdata shop has lived and died by pt-online-schema-change. I have automated the ptosc script wrapped in bash on centos6.5x in order to modify 13,000 plus identical schemas. Each schema represents a customers specific database.

Yes I simply love the tool, but … today we died by it running the identical syntax below.
You will notice this script also does online alters when the tables are empty to avoid slow ptosc startup but it was only performing a ptosc alter when the problem occurred.

The script


cmd_to_execute=“pt-online-schema-change --alter “$alter_cmd” D=$app,t=$alter_table --user $user_name -p$pass_word --execute --no-drop-old-table --no-drop-new-table --max-lag 60 --check-interval 120 --statistics --progress time,30 --print --chunk-size 1000 --chunk-time 0.5 --chunk-size-limit 4.0 --nocheck-replication-filters --max-load $max_load --critical-load $critical_load”

running_threads=0 # these are script pause control variables to sleep the process between ptosc alters
max_threads=250
acceptable_threads=50
max_load=“Threads_running=300” # these are ptosc pause control variables
critical_load=“Threads_running=1500”

key alter variables

alter_cmd=“MODIFY COLUMN site_id VARCHAR(512) DEFAULT NULL”
alter_table=“tablename”
alter_column=“site_id”
alter_column_type=“varchar(512)”
alter_online_sub_cmd=“,ALGORITH=INPLACE, LOCK=NONE”
alter_copy_sub_cmd=“,ALGORITHM=COPY, LOCK=SHARED”


Prior to the crash I had successfully executed the exact same ALTER on at least 6 customer databases of varying sizes up to around 60 million rows before. Todays execution was on a table with 400-600 million rows. Same table has super high volumes of inserts and reads, some updates, some deletes so I am basically stating the triggers were very busy.

Within 5 to 10 minutes of the job start thread counts rose to 2000 plus, and minutes afterwards mysql hung, and chose to crash itself. This was not a memory issue since mem has been around 70% of 384GB ram for months and it never rose above its current level.

Note also this system averages threads in the 3-25 range, but also spikes to 250 sometimes during high traffic and when we do truncates it will surge to 3000 threads for the 10-30 second truncates and immediately correct.

I am using percona-toolkit-2.2.15-2.noarch on Percona Mysql 5.6.23-72.1.
mysql> show variables like ‘%version%’;
±------------------------±-----------------------------------------------------+
| Variable_name | Value |
±------------------------±-----------------------------------------------------+
| innodb_version | 5.6.23-72.1 |
| protocol_version | 10 |
| slave_type_conversions | |
| version | 5.6.23-72.1-log |
| version_comment | Percona Server (GPL), Release 72.1, Revision 0503478 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
±------------------------±-----------------------------------------------------+
7 rows in set (0.00 sec)

select id from db.tablename order by id desc limit 1;
– 7,813,574,370 current autoincrement value
– PTOSC was estimating over a billion rows
– I suspect its actually lower around 400-600 million rows in the table since we truncate data older than 4 month partitions but the billion could have been correct since autoincrment is 7.8 billion.

Is it possible that I have the chunk values incorrect that might have caused this server to hang. My intent was to copy 1000 rows at a time and let the job run about 8 days.

Since there were 100’s of thousands messages in the mysql-error.log I have reduce it to what appeared meaningful.
I eliminated warnings related to autoincremented inserts being written to the log.

2015-11-03 20:58:19 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:58:24 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:58:19 7f949b455700TOO DEEP OR LONG SEARCH IN THE LOCK TABLE WAITS-FOR GRAPH, WE WILL
ROLL BACK FOLLOWING TRANSACTION
2015-11-03 20:58:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:01 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:21 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:01 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:21 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:55 7f9955f53700 InnoDB: Assertion failure in thread 140296548857600 in file srv0s
rv.cc line 2128
2015-11-03 21:00:58 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --e
xplicit_defaults_for_timestamp server option (see documentation for more details).
2015-11-03 21:00:59 17172 [Warning] option ‘innodb-sort-buffer-size’: unsigned value 5368709120 adj
usted to 67108864
2015-11-03 21:00:59 17172 [Note] Plugin ‘FEDERATED’ is disabled.
2015-11-03 21:00:59 17172 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-11-03 21:00:59 17172 [Note] InnoDB: The InnoDB memory heap is disabled
2015-11-03 21:00:59 17172 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-11-03 21:00:59 17172 [Note] InnoDB: Memory barrier is not used
2015-11-03 21:00:59 17172 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-11-03 21:00:59 17172 [Note] InnoDB: Using Linux native AIO
2015-11-03 21:00:59 17172 [Note] InnoDB: Using CPU crc32 instructions
2015-11-03 21:00:59 17172 [Note] InnoDB: Initializing buffer pool, size = 240.0G
2015-11-03 21:01:12 17172 [Note] InnoDB: Completed initialization of buffer pool
2015-11-03 21:01:14 17172 [Note] InnoDB: Highest supported file format is Barracuda.
2015-11-03 21:01:14 17172 [Note] InnoDB: Log scan progressed past the checkpoint lsn 11547291822249
6
2015-11-03 21:01:14 17172 [Note] InnoDB: Database was not shutdown normally!
2015-11-03 21:01:14 17172 [Note] InnoDB: Starting crash recovery.
2015-11-03 21:01:14 17172 [Note] InnoDB: Reading tablespace information from the .ibd files…
2015-11-03 21:01:14 17172 [Note] InnoDB: Restoring possible half-written data pages
2015-11-03 21:01:14 17172 [Note] InnoDB: from the doublewrite buffer…
2015-11-03 21:01:17 17172 [Note] InnoDB: Starting an apply batch of log records to the database…
2015-11-03 21:01:35 17172 [Note] InnoDB: 128 rollback segment(s) are active.
2015-11-03 21:01:35 17172 [Note] InnoDB: Waiting for purge to start
2015-11-03 21:01:35 17172 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.23-72.1 start
ed; log sequence number 115473012723041
2015-11-03 21:01:35 17172 [Note] Recovering after a crash using /var/log/mysql/mysql-bin
2015-11-03 21:01:35 17172 [Note] Starting crash recovery…
2015-11-03 21:01:35 17172 [Note] Crash recovery finished.
2015-11-03 21:01:36 17172 [Note] RSA private key file not found: /database/production//private_key.
pem. Some authentication plugins will not work.
2015-11-03 21:01:36 17172 [Note] RSA public key file not found: /database/production//public_key.pe
m. Some authentication plugins will not work.
2015-11-03 21:01:36 17172 [Note] Server hostname (bind-address): ‘*’; port: xxxx
2015-11-03 21:01:36 17172 [Note] IPv6 is available.
2015-11-03 21:01:36 17172 [Note] - ‘::’ resolves to ‘::’;
2015-11-03 21:01:36 17172 [Note] Server socket created on IP: ‘::’.
2015-11-03 21:01:36 17172 [Warning] ‘user’ entry ‘root@control-db-xx’ ignored in --skip-name-resolv
e mode.
2015-11-03 21:01:36 17172 [Warning] ‘proxies_priv’ entry ‘@ root@control-db-xx’ ignored in --skip-n
ame-resolve mode.
2015-11-03 21:01:36 17172 [Note] Slave I/O thread: connected to master ‘xxxxxxx@xxx.xx.xx.xxx:xxx
6’,replication started in log ‘mysql-bin.001503’ at position 4876
2015-11-03 21:01:36 17172 [Note] Slave SQL thread initialized, starting replication in log ‘mysql-b
in.001503’ at position 4876, relay log ‘/var/log/mysql/relay-bin.009018’ position: 283
2015-11-03 21:01:36 17172 [Note] Event Scheduler: Loaded 0 events
2015-11-03 21:01:36 17172 [Note] /usr/sbin/mysqld: ready for connections.

All help appreciated.
Again I am suspicious of the chunking technique I used as well as the volume of trigger sql.

Scott

Since the above problem #1 occurred I proved that ptosc is building bad chunk sql for a monthly partitioned table with 1 column autoincremented primary key, thus a 2 column composite pk (autoincrement id plus partition)
In sum, ptosc generated the further below bad sql query. Within five minutes begining of job the entire system locked up as detailed previously. I immediately killed ptosc and manually removed the triggers, the problem resolved itself. PTOSC should have sequentially moved from partition p1 to p12 chunking each partition rather than addressing all partitions in 1 sql statement. It appears a share lock was taken on ALL rows with the first chunk or due to the bad month references in the where clause the chunk was huge. Due to the immediate problem on a huge system i suspect an entire table share lock instead of just locking one chunk.

I later used PTOSC chunk-index and chunk-index-columns to have the ptosc ignore the 2nd attribute of the composite key, the monthly partition.
This solved the startup locking entire system problem, the millions of chunks ran smoothly all night long until the chunking moved into the current month’s activity where the millions of records are inserted every hour. At about 4AM THE SYSTEM cpu surged, normally around user cpu 1-3%. With the chunk contention system cpu(not user) moved towards 100% and stayed there for 3 hours.

Problem # 2 - my current problem
By 7AM time the chunking got to the current hours inserts the system DEADLOCKED. PTOSC was definitely still chunking and triggers were firing.

At this stage i am looking for the means to run the chunking in READ UNCOMMITTED MODE THUS NOT TAKING SHARE LOCKS the data currrently being inserted by normal business traffic.

All help greatly appreciated. My next steps while awaiting feedback are to back same large customer db to another set of identical nodes and recreate these problems.
– first without traffic to prove the alter can actually finish
– then with a mas parallel process insert routine firing at same table to prove problem recreated
– then hopefully with no share locks taken

Anxiously awaiting any feedback and thanks in advance.
Scott

The bad sql generated is below.

INSERT LOW_PRIORITY IGNORE INTO app.___table_new (id, ... many other columns, month, day_of_week) SELECT id, ... many other columns, month, day_of_weekFROMapp.table FORCE INDEX(PRIMARY) [B]WHERE ((id > 1) OR (id= 1 ANDmonth >= 1)) AND ((id < 1000) OR (id= 1000 ANDmonth` <= 1))
LOCK IN SHARE MODE;[/B] – /pt-online-schema-change 39997 copy nibble/
– 1 SIMPLE clicks range PRIMARY PRIMARY 9 6 Using where

SELECT /*!40001 SQL_NO_CACHE */ id, id, month
FROM app.table FORCE INDEX(PRIMARY)
WHERE ((id > 1) OR (id = 1 AND month >= 1))
ORDER BY id, month LIMIT 1000, 2 /next chunk boundary/;
– 1 SIMPLE clicks range PRIMARY PRIMARY 9 531,150,840 Using where; Using index

I later used PTOSC chunk-index and chunk-index-columns to have the ptosc ignore the 2nd attribute of the composite key, the monthly partition. This solved the startup locking entire system problem#1, the chunks ran smoothly all night long until the chunking moved into the current month’s activity where the millions of records are inserted every hour. At about 4aM THE SYSTEM cpu surged, normally around user cpu 1-3%, with the chunk contention it moved towards 100% and stayed there for 3 hours.

By the time the chunking got to the current hours inserts the system again locked up, unresolved problem #2

The much better sql generated but locked in share mode

FORCE INDEX(PRIMARY)
WHERE ((id >= 0))
AND ((id <= 8500000000)) LOCK IN SHARE MODE /pt-online-schema-change 6719 copy nibble/; Note it estimates 8.5 billion rows
– 1 SIMPLE clicks range PRIMARY PRIMARY 8 2 Using where

The chunk boundary
WHERE ((id > 1) OR (id = 1 AND month >= 1))
ORDER BY id, month LIMIT 1000, 2 /next chunk boundary/;
– 1 SIMPLE clicks range PRIMARY PRIMARY 9 531,150,840 Using where; Using index Explain estimates 500 million rows