Pt-online-schema-change - Exiting due to errors while restoring triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction

Hi,

We encounter error: Exiting due to errors while restoring triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction when using pt-online-schema-change.

We are using pt-online-schema-change to add NEW column to production DB table, name mydb.table1 (just masked the real name) which contain 60 million rows and contain a self defined trigger, after an hour the tool failed and finally the new table cannot be swapped, log and error is below:

Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `mydb`.`table1`...
No foreign keys reference `mydb`.`table1`; ignoring --alter-foreign-keys-method.
Creating new table...
CREATE TABLE `mydb`.`_table1_new` (
 **** masked for security **** 
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Created new table mydb._table1_new OK.
Altering new table...
ALTER TABLE `mydb`.`_table1_new` ADD COLUMN testtest DATETIME(6)
Altered `mydb`.`_table1_new` OK.
2021-01-24T00:26:04 Creating triggers...
2021-01-24T00:26:04 Created triggers OK.
2021-01-24T00:26:04 Copying approximately 62489158 rows...
INSERT LOW_PRIORITY IGNORE INTO `mydb`.`_table1_new` **** masked for security ****  /*pt-online-schema-change 31268 copy nibble*/
SELECT /*!40001 SQL_NO_CACHE */ `pk` FROM `mydb`.`table1` FORCE INDEX(`PRIMARY`) WHERE ((`pk` >= ?)) ORDER BY `pk` LIMIT ?, 2 /*next chunk boundary*/
Copying `mydb`.`table1`:   0% 02:57:05 remain
Copying `mydb`.`table1`:   0% 02:47:04 remain
*** skip long process ***
Copying `mydb`.`table1`:   0% 02:49:56 remain
Copying `mydb`.`table1`:  99% 00:00 remain
2021-01-24T04:40:43 Copied rows OK.
2021-01-24T04:40:43 Adding original triggers to new table.
Exiting due to errors while restoring triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "LOCK TABLES `mydb`.`_table1_new` WRITE, `mydb`. `table1` WRITE;"] at /bin/pt-online-schema-change line 10109.
2021-01-24T04:41:45 Dropping triggers...
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_del`
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_upd`
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_ins`
2021-01-24T04:41:46 Dropped triggers OK.
Not dropping the new table `mydb`.`_table1_new` because --swap-tables failed.  To drop the new table, execute:
DROP TABLE IF EXISTS `mydb`.`_table1_new`;
# Event  Count
# ====== =====
# INSERT 20400
`mydb`.`table1` was not altered.
        (in cleanup) DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "LOCK TABLES `mydb`.`_table1_new` WRITE, `mydb`. `table1` WRITE;"] at /bin/pt-online-schema-change line 10109.
2021-01-24T04:41:46 Dropping triggers...
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_del`
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_upd`
DROP TRIGGER IF EXISTS `mydb`.`pt_osc_mydb_table1_ins`
2021-01-24T04:41:46 Dropped triggers OK.
Not dropping the new table `mydb`.`_table1_new` because --swap-tables failed.  To drop the new table, execute:
DROP TABLE IF EXISTS `mydb`.`_table1_new`;
# Event  Count
# ====== =====
# INSERT 20400
`mydb`.`table1` was not altered.


pt-online-schema-change version:
pt-online-schema-change 3.2.1

Our current setup:
2 node master master, DB version:
Server version: 5.7.30-33-57-log Percona XtraDB Cluster (GPL), Release rel33, Revision 5dd6d59, WSREP version 31.43, wsrep_31.43

pt-online-schema-change command (we have swap the command in bash script, below is osc part only):
pt-online-schema-change --preserve-triggers --defaults-file=“pt-online-schema-change_my.cnf” --max-load=Threads_running:200 --critical-load=Threads_running:210 --nocheck-unique-key-change D={DB},t={TABLE} --socket=/home/mysql/mysql.sock --print --statistics --execute --alter-foreign-keys-method=auto --alter "{CMD}\"" < /dev/null 2>&1 | tee -a /tmp/{date}.txt

original trigger in mydb.table1:
CREATE TRIGGER after_table1_delete AFTER DELETE ON table1 FOR EACH ROW INSERT INTO table1_delete_log
SET PK = OLD.PK, modifiedTS = NOW(), createdTS = NOW();

May i know anyone have encountered above error and advise how to avoid?

I checked source code of pt-online-schema-change 3.2.1, line 11552 that need issue LOCK TABLES command:

push @$sqls, "LOCK TABLES `$args{db}`.`$args{new_tbl}` WRITE, `$args{db}`. `$args{orig_tbl}` WRITE;";

Is that necessary and safe? As i do not expect any table lock using pt-online-schema-change.

2 Likes

@hokan, The lock is needed in this case to “move” the pre-existing triggers from the old table to the new table before swapping them, in order to maintain sync. Try to imagine the lock not existing. WRITEs are happening on old table, executing the trigger. Now ptosc needs to move the trigger. It drops the original trigger and adds it to new table. During that time, writes happen to old table and do not execute the trigger. You now have data inconsistency.
I would look on the ptosc manual to increase the lock-wait-timeout. IIRC, this is set to 1 second for the session that ptosc uses. You could also drop the pre-existing trigger first, if your app can handle that to remove this issue altogether.

2 Likes

@matthewb
Thanks for your quick reply and detail answer for the reason.

I understand the lock is in order to maintain sync.

Could you explain more that as i checked the LOCK TABLES only happen in step Adding original triggers to new table., and next step is swap_tables, during that time if action happen in original table, is that we still have chance miss some data as trigger is removed and copied to being swap shadow table ?

Like in my case i want to capture any delete action in mydb.table1 and insert to mydb.table1_delete_log, so after trigger deleted in mydb.table1 and before swap table, i will lose any delete action captured in mydb.table1?

the src code i refer:

sub create_trigger_sql {
   my (%args) = @_;
   my @required_args = qw(trigger db new_tbl);
   foreach my $arg ( @required_args ) {
      die "I need a $arg argument" unless $args{$arg};
   }

    my $trigger = $args{trigger};
    my $suffix = $args{duplicate_trigger} ? random_suffix() : '';
    if (length("$trigger->{trigger_name}$suffix") > 64) {
        die "New trigger name $trigger->{trigger_name}$suffix is too long";
    }

    my $definer = $args{trigger}->{definer} | '';
    $definer =~ s/@/`@`/;
    $definer = "`$definer`" ;

    my $sqls = [];
    push @$sqls, "LOCK TABLES `$args{db}`.`$args{new_tbl}` WRITE, `$args{db}`. `$args{orig_tbl}` WRITE;";
    push @$sqls, '/*!50003 SET @saved_sql_mode = @@sql_mode */';
    push @$sqls, '/*!50003 SET @saved_cs_client      = @@character_set_client */ ;';
    push @$sqls, '/*!50003 SET @saved_cs_results     = @@character_set_results */ ;';
    push @$sqls, '/*!50003 SET @saved_col_connection = @@collation_connection */ ;';

    push @$sqls, "/*!50003 SET character_set_client  = $trigger->{character_set_client} */ ;";
    push @$sqls, "/*!50003 SET collation_connection  = $trigger->{collation_connection} */ ;";
    push @$sqls, "SET SESSION sql_mode = '$trigger->{sql_mode}'";
    
    push @$sqls, "DROP TRIGGER IF EXISTS `$args{db}`.`$trigger->{trigger_name}` " if ! $args{duplicate_trigger};

    push @$sqls, "CREATE DEFINER=$definer " 
               . "TRIGGER `$args{db}`.`$trigger->{trigger_name}$suffix` "
               . "$trigger->{action_timing} $trigger->{event_manipulation} ON $args{new_tbl}\n"
               . "FOR EACH ROW\n"
               . $trigger->{action_statement};

    push @$sqls, '/*!50003 SET sql_mode              = @saved_sql_mode */ ;';
    push @$sqls, '/*!50003 SET character_set_client  = @saved_cs_client */ ;';
    push @$sqls, '/*!50003 SET character_set_results = @saved_cs_results */';
    push @$sqls, '/*!50003 SET collation_connection  = @saved_col_connection */ ;';
    push @$sqls, 'UNLOCK TABLES';

   return $sqls;
    
}

For the 2 suggestion,
1 Increase lock_wait_timeout:
As LOCK TABLES must gain metadata lock (MDL) with lock type SHARED_NO_READ_WRITE, i think this will have great impact as all DML action will be blocked if lock_wait_timeout is too long.

2 drop the pre-existing trigger:
Unfortunately, as that table take more than hour to do osc and if we drop the pre-existing trigger first, we will have large data lose should be captured by pre-existing trigger.

Just FYI, base on error log, i think the timeout is controlled by lock_wait_timeout and default is 60 seconds
https://www.percona.com/doc/percona-toolkit/3.0/pt-online-schema-change.html#cmdoption-pt-online-schema-change-set-vars

2021-01-24T04:40:43 Adding original triggers to new table.
Exiting due to errors while restoring triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "LOCK TABLES `mydb`.`_table1_new` WRITE, `mydb`. `table1` WRITE;"] at /bin/pt-online-schema-change line 10109.
2021-01-24T04:41:45 Dropping triggers...
2 Likes

@hokan, your analysis is exactly what I was thinking was the reason.

Solutions 1 and 2 seem to be your only choice. Solution 2 seems easier. You can do this manually. Before you start ptosc, drop the trigger, then dump all of the primary keys in this table to a text file. That should go very quick. Then run ptosc. After ptosc is done, dump all of the PKs from the new table to a new text file. Write a simple script to read the PKs from each file to determine the difference and run a DELETE to remove the ones that should not be there.

Otherwise, you are stuck and ptosc isn’t going to work for you in this case.

Well, while ptosc is running, you might be able to add the trigger to the new table manually and just tell ptosc not to move the triggers. That might be another method.

2 Likes

@matthewb

I just wonder if we still have chance lose data during the gap time of Adding original triggers to new table and swap_tables, seems issuing LOCK TABLES in trigger exchange between old table to new table can not help too much?

May i know how about we just remove the lock of below 2 line to avoid the error Exiting due to errors while restoring triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction i am facing ?

line 11552: push @$sqls, "LOCK TABLES `$args{db}`.`$args{new_tbl}` WRITE, `$args{db}`. `$args{orig_tbl}` WRITE;";
....
line 11574 push @$sqls, 'UNLOCK TABLES';
2 Likes

Dunno if it’s related, but restoring triggers is an issue for me too, even when ptosc completes:
See:

https://jira.percona.com/browse/PT-1919

I could only avoid the issue by ensuring the FK mode rebuild_constraint and avoiding the drop_swap.
Maybe that can help you.

2 Likes

@hokan, I can’t say for certain on your situation. pt-osc has been the most popular tool for online changes for the past 8+ years. I can’t imagine that you would be the first person to face such an issue as a simple delete trigger. I’m not a developer on the toolkit team so all I can tell you is that 1000s of people use these tools all the time without loss of data.

I HIGHLY recommend that you test on your dev/testing infrastructure before doing production to make sure.

My original suggestion of increasing innodb lock_wait timeout via pt-osc still stands as the best solution to your issue.

2 Likes

Here, locking tables while there are many connections churning is likely difficult. Mysql usual timeout is 50sec, so this is short enough to fail when slow queries are ongoing. Maybe pt-osc adds a session override of the lock timeout in some places (search for “set innodb_lock_wait_timeout=”), but it’s unlikely infinite. It should definitely be an implementation fix to allow for a custom innodb lock wait timeout on every connection made by ptosc.

As for testing, in PT-1919 the issue occurred only in prod because the behavior of ptosc changed with the volume of data. So read all heuristic behaviors the doc mentions; there could be more undocumented too. I just need to restate: --alter-foreign-keys-method=auto is dangerous (at least with 3.2.1) as it may choose to drop_swap, which will NOT restore triggers.

2 Likes

There is no “maybe”. I can confirm that this is the behavior of pt-osc. pt-osc always sets its own session to lock-wait-timeout of 1s as to not impact other aspects of mysql. However, if your system is busy, then pt-osc will always fail with such a short timeout. I would change the timeout to 5s at minimum.

2 Likes

Ok, it should keep the 1sec for ordinary copy work but for closing the migration (like the lock tables for the final trigger/fk/swap), it could be more aggressive otherwise I can’t see how it would ever grab a broad table lock if it returns at the end of the request queue every time. (maybe even a growing timeout as failures occur).

Ptosc is used particularly for online changes that are hard to do when table is busy. It wouldn’t be too harsh for ptosc to get in line for a table lock for the regular timeout (or even longer) to do the important finalizing trigger/FK installation. At least it should be a param. The --set-vars and --tries give a bit of control, but I don’t see any distinction between low and high priority statements.

2 Likes

MySQL/InnoDB doesn’t give you the ability to visualize high/low priority statements so there’s nothing that can be done from a user standpoint either. A timeout of 5s does not imply 5s of waiting even if acquired after 0.2s, so I don’t understand why you would need to split high/low statements anyways.

2 Likes

I’m not alluding to mysql’s concept of priority.
I’m creating a concept of distinct tolerance to timeout among ptosc statements.
Row copies and most throttled work can certainly use 1s lock timeout.
But terminating the migration with a “broad-hard-to-get full table write lock” should be willing to wait A LOT longer for a lock, so that it stands a chance of acquiring it. Otherwise, it’s very unlikely that a table lock would be obtainable in 1s on a table that sees incessant concurrent queries.

2 Likes

Hi @bob , that does make sense. I suggest you fill out a feature request https://jira.percona.com so our engineering team can take a look and possibly implement such a suggestion.

2 Likes

added https://jira.percona.com/browse/PT-1928

2 Likes

FYI, it was identified that 3.3.0 doesn’t restore BEFORE triggers (in either FK mode drop_swap or rebuild_constraint).

1 Like