Hello!
We have 3 node + 2 garb Percona Cluster and around every week we need to restart full cluster because it stops executing querries. Firstly a simple node2 restart was sufficient to make it work again, but it stopped working and full restart is required for it to start processing querries.
As of today we managed to get logs with “wsrep_debug=1” but we didn’t yet proceed with getting more details with “show processlist” and “show engine innodb status\G” during such incident. We are preparing to do so and we would like to ask:
- Is there anything else we can do/try to collect more information that would help troubleshot such issue?
As for now logs didn’t show anything labeled “[ERROR]” before occurence of the problem and neither do “[Warning]” labels. Only noticable fact is flood (from 3x to 10x notes per min) of:
[Note] [MY-000000] [WSREP] ha_rollback_trans(774530, FALSE) rolled back: <QUERRY>: XXLock wait timeout exceeded; try restarting transaction;
After bootstraping cluster when it has to catch up there’s around 30x more such logs per min (for about 2 mins) without any problems. Anyway, the second question for now:
- Is there anything else we could try instead of restarting whole cluster?