Pt-online-schema-change causes MariaDB/MySQL database to crash for very large tables

I am currently trying to run pt-online-schema-change on a MariaDB (v 10.2.36) database, in an attempt to “shrink” InnoDB tables. The table itself is very large – weighing in at 450Gb.

I am running pt-online-schema-change (version 3.2.1) to attempt this, as follows:

nohup pt-online-schema-change D=my_database,t=very_large_table,h=127.0.0.1,u=root --alter=“ENGINE=InnoDB” --alter-foreign-keys-method=“auto” --execute &

So basically it spends a long parsing and creating a new version of the table, but then it eventually CRASHES the mySql database, with the following error.log message:

[ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

I have tried the above command on a smaller table (40Gb) and there are no problems at all, but in the case of the large table, I have not had a lot of luck…

Does anyone have any suggestions, perhaps some ‘magic’ command-line options to use?

Thanks a lot for your help,

Colin

@colin Unfortunately there are no “magic” options. FOREIGN KEYS create much more work for the database because additional locks on all rows in parent tables must be acquired. These metadata locks can cause long timeouts to occur. If possible, attempt to stop (or reduce as much as possible) all other DML while running this alter table.

I suggest reading the manual https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html and pay careful attention to the ‘alter-foreign-keys-method’ parameter. Perhaps using drop-swap or none may help, but do be aware of the consequences.

Depending on how many FKs you have, you might get much faster results by dropping the FKs, running the ALTER, then adding the FKs back in manually.

Thanks for the helpful comment @matthewb , I took a careful look at the table and the corresponding documentation.

One interesting point, with regards to Foreign Keys, there are none within this table, let me show you what the format is:

CREATE TABLE TableA (

column1 int(11) NOT NULL,

column2 int(11) NOT NULL,

column3 int(11) NOT NULL,

column4 int(11) NOT NULL,

column5 varchar(25) NOT NULL DEFAULT ‘0’,

column6 varchar(2048) DEFAULT NULL,

lastUpdated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

PRIMARY KEY (column1,column2,column3,column4,column5),

KEY rd_time (lastUpdated)

) ENGINE=InnoDB DEFAULT CHARSET=latin1

So, I might try using –alter-foreign-keys-method=none” to see if this will speed things up. Also, this composite Primary Key, is that something that I need to take into consideration?

Given the knowledge of the table structure, are there any other options that you might suggest?

Thanks,

Colin

If there are no foreign keys, just remove ‘–alter-foreign-keys-method’ that parameter completely; you don’t need it.

Remove the ‘nohup’ and the ‘&’. Why do you have those?

You might want to try a specific chunk size, –chunk-size=5000 to see if that helps with the locking issues.

Thanks very much for that @matthewb , I will try what you have suggested.

Btw, ‘nohup’ and ‘&’ are UNIX related and will not affect the operation of the command.

Cheers,

Colin

I know what nohup and & are for, I just understand why you have them there. Why are you wanting to background pt-osc? A better option would be to use screen or tmux

Hi @matthewb , I am running this on an AWS connected EC2 server and I cannot rely on the connection remaining open, hence I am using nohup , which is probably a little ‘old fashioned’.

Thanks for clueing me in, I will ask the Unix guru’s around here what I should use – screen, tmux or …?

I use screen. It absolutely helps with disconnections from SSH.

Yes, I just checked with our resident Unix guru and he totally agrees – screen!

Thanks for the tip!

Hi @matthewb ,

Unfortunately, the same problem as before – I’m out of ideas…:

Command output:

Copying MyDatabase.TableName: 94% 13:11 remain

Copying MyDatabase.TableName: 95% 12:05 remain

2020-11-24T13:07:27 Error dropping trigger: DBI connect(‘MyDatabase;host=127.0.0.1;mysql_read_default_group=client’,‘root’,…) failed: Can’t connect to MySQL server on ‘127.0.0.1’ (111 “Connection refused”) at /usr/bin/pt-online-schema-change line 2345.

Error log output:


END OF INNODB MONITOR OUTPUT

============================

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-11-24 13:07:26 139635055142656 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

201124 13:07:26 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.2.36-MariaDB-log

Hmm. It says error dropping trigger on the crash. Try --no-drop-triggers This has the side effect of keeping the old table around after the rename. If this works, you’ll have to manually drop the old table.

Good idea, I will set it up and attempt this, now.

ps: And I will use screen as well :wink:

Hi @matthewb ,

I ran the job last night using the --no-drop-triggers switch and it again ‘hung’ the database, (the output follows

  1. ).

    I am wondering if this has ever happened before, with an extra large table?

  2. Is there a switch that we could use to combat this timeout from occurring?
  3. Is there, perhaps at the mySql end, a configuration option which could remedy this problem?

Thanks,

Colin

Command output:

Copying MyDatabase.TableA: 96% 09:08 remain

Copying MyDatabase.TableA: 96% 08:03 remain

Not dropping triggers because --no-drop-triggers was specified. To drop the triggers, execute:

DROP TRIGGER IF EXISTS TableA.pt_osc_mydatabase_tablea_del

DROP TRIGGER IF EXISTS TableA.pt_osc_mydatabase_tablea_upd

DROP TRIGGER IF EXISTS TableA.pt_osc_mydatabase_tablea_ins

MyDatabasae.TableA was not altered.

(in cleanup) 2020-11-24T20:59:25 Error copying rows from `MyDatabase`.`TableA` to `MyDatabase`.`_TableA_new`: DBI connect('MyDatabase;host=127.0.0.1;mysql_read_default_group=client','root',...) failed: Can't connect to MySQL server on '127.0.0.1' (111 "Connection refused") at /usr/bin/pt-online-schema-change line 2345.

From the error log:

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-11-24 20:59:24 140504169277184 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

201124 20:59:24 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

at the DROP triggers stage, kill any long running SELECT transactions