After having run ptosc in automated fashion I have probably run ptosc over 50,000 times SUCCESSFULLY so to date this bigdata shop has lived and died by pt-online-schema-change. I have automated the ptosc script wrapped in bash on centos6.5x in order to modify 13,000 plus identical schemas. Each schema represents a customers specific database.
Yes I simply love the tool, but … today we died by it running the identical syntax below.
You will notice this script also does online alters when the tables are empty to avoid slow ptosc startup but it was only performing a ptosc alter when the problem occurred.
The script
cmd_to_execute=“pt-online-schema-change --alter “$alter_cmd” D=$app,t=$alter_table --user $user_name -p$pass_word --execute --no-drop-old-table --no-drop-new-table --max-lag 60 --check-interval 120 --statistics --progress time,30 --print --chunk-size 1000 --chunk-time 0.5 --chunk-size-limit 4.0 --nocheck-replication-filters --max-load $max_load --critical-load $critical_load”
running_threads=0 # these are script pause control variables to sleep the process between ptosc alters
max_threads=250
acceptable_threads=50
max_load=“Threads_running=300” # these are ptosc pause control variables
critical_load=“Threads_running=1500”
key alter variables
alter_cmd=“MODIFY COLUMN site_id VARCHAR(512) DEFAULT NULL”
alter_table=“tablename”
alter_column=“site_id”
alter_column_type=“varchar(512)”
alter_online_sub_cmd=“,ALGORITH=INPLACE, LOCK=NONE”
alter_copy_sub_cmd=“,ALGORITHM=COPY, LOCK=SHARED”
Prior to the crash I had successfully executed the exact same ALTER on at least 6 customer databases of varying sizes up to around 60 million rows before. Todays execution was on a table with 400-600 million rows. Same table has super high volumes of inserts and reads, some updates, some deletes so I am basically stating the triggers were very busy.
Within 5 to 10 minutes of the job start thread counts rose to 2000 plus, and minutes afterwards mysql hung, and chose to crash itself. This was not a memory issue since mem has been around 70% of 384GB ram for months and it never rose above its current level.
Note also this system averages threads in the 3-25 range, but also spikes to 250 sometimes during high traffic and when we do truncates it will surge to 3000 threads for the 10-30 second truncates and immediately correct.
I am using percona-toolkit-2.2.15-2.noarch on Percona Mysql 5.6.23-72.1.
mysql> show variables like ‘%version%’;
±------------------------±-----------------------------------------------------+
| Variable_name | Value |
±------------------------±-----------------------------------------------------+
| innodb_version | 5.6.23-72.1 |
| protocol_version | 10 |
| slave_type_conversions | |
| version | 5.6.23-72.1-log |
| version_comment | Percona Server (GPL), Release 72.1, Revision 0503478 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
±------------------------±-----------------------------------------------------+
7 rows in set (0.00 sec)
select id from db.tablename order by id desc limit 1;
– 7,813,574,370 current autoincrement value
– PTOSC was estimating over a billion rows
– I suspect its actually lower around 400-600 million rows in the table since we truncate data older than 4 month partitions but the billion could have been correct since autoincrment is 7.8 billion.
Is it possible that I have the chunk values incorrect that might have caused this server to hang. My intent was to copy 1000 rows at a time and let the job run about 8 days.
Since there were 100’s of thousands messages in the mysql-error.log I have reduce it to what appeared meaningful.
I eliminated warnings related to autoincremented inserts being written to the log.
2015-11-03 20:58:19 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:58:24 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:58:19 7f949b455700TOO DEEP OR LONG SEARCH IN THE LOCK TABLE WAITS-FOR GRAPH, WE WILL
ROLL BACK FOLLOWING TRANSACTION
2015-11-03 20:58:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:01 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:21 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 20:59:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:01 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:21 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:41 7f9955552700 INNODB MONITOR OUTPUT
2015-11-03 21:00:55 7f9955f53700 InnoDB: Assertion failure in thread 140296548857600 in file srv0s
rv.cc line 2128
2015-11-03 21:00:58 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --e
xplicit_defaults_for_timestamp server option (see documentation for more details).
2015-11-03 21:00:59 17172 [Warning] option ‘innodb-sort-buffer-size’: unsigned value 5368709120 adj
usted to 67108864
2015-11-03 21:00:59 17172 [Note] Plugin ‘FEDERATED’ is disabled.
2015-11-03 21:00:59 17172 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-11-03 21:00:59 17172 [Note] InnoDB: The InnoDB memory heap is disabled
2015-11-03 21:00:59 17172 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-11-03 21:00:59 17172 [Note] InnoDB: Memory barrier is not used
2015-11-03 21:00:59 17172 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-11-03 21:00:59 17172 [Note] InnoDB: Using Linux native AIO
2015-11-03 21:00:59 17172 [Note] InnoDB: Using CPU crc32 instructions
2015-11-03 21:00:59 17172 [Note] InnoDB: Initializing buffer pool, size = 240.0G
2015-11-03 21:01:12 17172 [Note] InnoDB: Completed initialization of buffer pool
2015-11-03 21:01:14 17172 [Note] InnoDB: Highest supported file format is Barracuda.
2015-11-03 21:01:14 17172 [Note] InnoDB: Log scan progressed past the checkpoint lsn 11547291822249
6
2015-11-03 21:01:14 17172 [Note] InnoDB: Database was not shutdown normally!
2015-11-03 21:01:14 17172 [Note] InnoDB: Starting crash recovery.
2015-11-03 21:01:14 17172 [Note] InnoDB: Reading tablespace information from the .ibd files…
2015-11-03 21:01:14 17172 [Note] InnoDB: Restoring possible half-written data pages
2015-11-03 21:01:14 17172 [Note] InnoDB: from the doublewrite buffer…
2015-11-03 21:01:17 17172 [Note] InnoDB: Starting an apply batch of log records to the database…
2015-11-03 21:01:35 17172 [Note] InnoDB: 128 rollback segment(s) are active.
2015-11-03 21:01:35 17172 [Note] InnoDB: Waiting for purge to start
2015-11-03 21:01:35 17172 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.23-72.1 start
ed; log sequence number 115473012723041
2015-11-03 21:01:35 17172 [Note] Recovering after a crash using /var/log/mysql/mysql-bin
2015-11-03 21:01:35 17172 [Note] Starting crash recovery…
2015-11-03 21:01:35 17172 [Note] Crash recovery finished.
2015-11-03 21:01:36 17172 [Note] RSA private key file not found: /database/production//private_key.
pem. Some authentication plugins will not work.
2015-11-03 21:01:36 17172 [Note] RSA public key file not found: /database/production//public_key.pe
m. Some authentication plugins will not work.
2015-11-03 21:01:36 17172 [Note] Server hostname (bind-address): ‘*’; port: xxxx
2015-11-03 21:01:36 17172 [Note] IPv6 is available.
2015-11-03 21:01:36 17172 [Note] - ‘::’ resolves to ‘::’;
2015-11-03 21:01:36 17172 [Note] Server socket created on IP: ‘::’.
2015-11-03 21:01:36 17172 [Warning] ‘user’ entry ‘root@control-db-xx’ ignored in --skip-name-resolv
e mode.
2015-11-03 21:01:36 17172 [Warning] ‘proxies_priv’ entry ‘@ root@control-db-xx’ ignored in --skip-n
ame-resolve mode.
2015-11-03 21:01:36 17172 [Note] Slave I/O thread: connected to master ‘xxxxxxx@xxx.xx.xx.xxx:xxx
6’,replication started in log ‘mysql-bin.001503’ at position 4876
2015-11-03 21:01:36 17172 [Note] Slave SQL thread initialized, starting replication in log ‘mysql-b
in.001503’ at position 4876, relay log ‘/var/log/mysql/relay-bin.009018’ position: 283
2015-11-03 21:01:36 17172 [Note] Event Scheduler: Loaded 0 events
2015-11-03 21:01:36 17172 [Note] /usr/sbin/mysqld: ready for connections.
All help appreciated.
Again I am suspicious of the chunking technique I used as well as the volume of trigger sql.
…
Scott