Pt-online-schema-change wsrep deadlock found

we use ptosc with liquibase to apply changes to schema. We have a small cluster with thousands of databases, each representing a tenant. This means that each time we need to apply a structure change to the schema it would take one process a long time to go over all databases to apply the changes.

Recently we started experimenting with scaling the number of liquibase containers with the idea of applying changes to more than 1 database at a time in a test environment (no other traffic). As soon as we went to applying changes to 5 databases simultaneously we were consistently hit with one or more liquibase processes failing with deadlocks. A blank tenant would need to apply about ~10 structure patches to catch up. The tables that the patches are being applied are empty in the test case. Liquibase/ptosc always fails on this exact step:

08:22:04.769 INFO  [liquibase.Liquibase]: Error creating triggers: DBD::mysql::db selectall_arrayref failed: WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction [for Statement "SELECT TRIGGER_SCHEMA, TRIGGER_NAME, DEFINER, ACTION_STATEMENT, SQL_MODE,        CHARACTER_SET_CLIENT, COLLATION_CONNECTION, EVENT_MANIPULATION, ACTION_TIMING   FROM INFORMATION_SCHEMA.TRIGGERS  WHERE EVENT_MANIPULATION = ?    AND ACTION_TIMING = ?    AND TRIGGER_SCHEMA = ?    AND EVENT_OBJECT_TABLE = ?"] at /usr/bin/pt-online-schema-change line 11444.

While trying to debug this, we were monitoring the full processlist on the node and no process hold an active query for any prolonged amount of time (>1s). We looked at the innodb engine status and there was no deadlock being reported as well. We tried with pt-deadlock-logger, again no lock was logged. There are still some things we will be trying out as time allows, but just wanted to ask the community if there is anything obvious that im missing. The lock wait timeout is not configured in my.cnf, and the node reports the value at default - 50s.

This is on mysql 5.7.40-31.63.1.el7, ptosc - 3.5.4-2.el7. The cluster is running 3 nodes with one of those being a writer node.

Can you confirm that you are running liquidbase only against 1 node within a cluster? Attempting to run multiple ALTERs at the same time on the same cluster from different nodes will not work.

Can confirm that all alter statements are run against a single node, each time the same node.
There are multiple liquibase / ptosc processes running at the same time, all against a single node.

You need to reduce the number of parallel processes. With too many processes, you will get deadlocks.

Yes, thats what we are experiencing. The weird part is that none of the standard tooling we’ve used (explained above) report or see the deadlock. Which is preventing us from trying to improve that specific bit in the process.

none of the standard tooling we’ve used (explained above) report or see the deadlock

Because it’s not an InnoDB deadlock; it’s a WSREP deadlock. :wink: The name is misleading as it isn’t so much as a row-lock-deadlock but more of an internal-wsrep-locking-conflict.

That makes sense! Thank you. Are there any tools that help to diagnose / debug / look into the WSREP locks? We would like to understand the bottleneck a bit better.

Watch the counters for BFA/LCF (brute-force aborts, local-certification failures). This will tell you that you have transactions originating on other nodes, or local node, which are conflicting with pt-osc’s queries to move the data. Make sure all writes are isolated to a single node, and make sure that node is also the node executing the pt-osc. If you are executing pt-osc on node2 and application is writing to node1, then it is completely understandable why you are getting BFAs.

Thanks, will look into those.