Pt-online-schema-change wsrep deadlock found

hristomitev · February 19, 2024, 2:28pm

we use ptosc with liquibase to apply changes to schema. We have a small cluster with thousands of databases, each representing a tenant. This means that each time we need to apply a structure change to the schema it would take one process a long time to go over all databases to apply the changes.

Recently we started experimenting with scaling the number of liquibase containers with the idea of applying changes to more than 1 database at a time in a test environment (no other traffic). As soon as we went to applying changes to 5 databases simultaneously we were consistently hit with one or more liquibase processes failing with deadlocks. A blank tenant would need to apply about ~10 structure patches to catch up. The tables that the patches are being applied are empty in the test case. Liquibase/ptosc always fails on this exact step:

08:22:04.769 INFO  [liquibase.Liquibase]: Error creating triggers: DBD::mysql::db selectall_arrayref failed: WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction [for Statement "SELECT TRIGGER_SCHEMA, TRIGGER_NAME, DEFINER, ACTION_STATEMENT, SQL_MODE,        CHARACTER_SET_CLIENT, COLLATION_CONNECTION, EVENT_MANIPULATION, ACTION_TIMING   FROM INFORMATION_SCHEMA.TRIGGERS  WHERE EVENT_MANIPULATION = ?    AND ACTION_TIMING = ?    AND TRIGGER_SCHEMA = ?    AND EVENT_OBJECT_TABLE = ?"] at /usr/bin/pt-online-schema-change line 11444.

While trying to debug this, we were monitoring the full processlist on the node and no process hold an active query for any prolonged amount of time (>1s). We looked at the innodb engine status and there was no deadlock being reported as well. We tried with pt-deadlock-logger, again no lock was logged. There are still some things we will be trying out as time allows, but just wanted to ask the community if there is anything obvious that im missing. The lock wait timeout is not configured in my.cnf, and the node reports the value at default - 50s.

This is on mysql 5.7.40-31.63.1.el7, ptosc - 3.5.4-2.el7. The cluster is running 3 nodes with one of those being a writer node.

matthewb · February 21, 2024, 12:28am

Can you confirm that you are running liquidbase only against 1 node within a cluster? Attempting to run multiple ALTERs at the same time on the same cluster from different nodes will not work.

hristomitev · February 21, 2024, 6:39am

Can confirm that all alter statements are run against a single node, each time the same node.
There are multiple liquibase / ptosc processes running at the same time, all against a single node.

matthewb · February 21, 2024, 4:04pm

You need to reduce the number of parallel processes. With too many processes, you will get deadlocks.

hristomitev · February 21, 2024, 4:31pm

Yes, thats what we are experiencing. The weird part is that none of the standard tooling we’ve used (explained above) report or see the deadlock. Which is preventing us from trying to improve that specific bit in the process.

matthewb · February 21, 2024, 4:44pm

none of the standard tooling we’ve used (explained above) report or see the deadlock

Because it’s not an InnoDB deadlock; it’s a WSREP deadlock. The name is misleading as it isn’t so much as a row-lock-deadlock but more of an internal-wsrep-locking-conflict.

hristomitev · February 21, 2024, 5:15pm

That makes sense! Thank you. Are there any tools that help to diagnose / debug / look into the WSREP locks? We would like to understand the bottleneck a bit better.

matthewb · February 21, 2024, 5:31pm

Watch the counters for BFA/LCF (brute-force aborts, local-certification failures). This will tell you that you have transactions originating on other nodes, or local node, which are conflicting with pt-osc’s queries to move the data. Make sure all writes are isolated to a single node, and make sure that node is also the node executing the pt-osc. If you are executing pt-osc on node2 and application is writing to node1, then it is completely understandable why you are getting BFAs.

hristomitev · February 22, 2024, 7:51am

Thanks, will look into those.

Topic		Replies	Views
Using pt-online-schema-change with percona 5.7 Percona Toolkit	2	1211	November 29, 2017
WSREP detected deadlock/conflict and aborted the transaction Percona XtraDB Cluster 5.x	3	15605	December 11, 2017
pt-online-schema-change getting hanged Other MySQL® Questions	2	2526	September 17, 2012
Error Code: 1213. WSREP with concurrent DML when DDL is executed Percona XtraDB Cluster 5.x	3	1267	December 12, 2018
pt-online-schema-change pausing because {something}=0, bug? Percona Toolkit	1	1252	August 31, 2016

Pt-online-schema-change wsrep deadlock found

Related topics