PXC stops processing querries

Hello!
We have 3 node + 2 garb Percona Cluster and around every week we need to restart full cluster because it stops executing querries. Firstly a simple node2 restart was sufficient to make it work again, but it stopped working and full restart is required for it to start processing querries.
As of today we managed to get logs with “wsrep_debug=1” but we didn’t yet proceed with getting more details with “show processlist” and “show engine innodb status\G” during such incident. We are preparing to do so and we would like to ask:

  1. Is there anything else we can do/try to collect more information that would help troubleshot such issue?

As for now logs didn’t show anything labeled “[ERROR]” before occurence of the problem and neither do “[Warning]” labels. Only noticable fact is flood (from 3x to 10x notes per min) of:

[Note] [MY-000000] [WSREP] ha_rollback_trans(774530, FALSE) rolled back: <QUERRY>: XXLock wait timeout exceeded; try restarting transaction;

After bootstraping cluster when it has to catch up there’s around 30x more such logs per min (for about 2 mins) without any problems. Anyway, the second question for now:

  1. Is there anything else we could try instead of restarting whole cluster?

Hey @Cirrus.pl,

This is a generic error from the InnoDB engine. It means query A started a transaction which locked some rows. While A was running/locked, query B started and tried to acquire a lock on the same rows. This eventually times out and you get this error.

Please ensure that you have proper indexes, since InnoDB uses indexes to manage row locks. If you don’t have good indexes then you’ll see this issue a bunch.

Definitely try to get SHOW ENGINE INNODB STATUS, which will show transactions and how many rows each txn has locked.