Hi! My PXC 5.7 cluster is crash looping with the following error:
2021-11-22T08:57:27.796235Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.34-37 started; log sequence number 2264189188
2021-11-22T08:57:27.796281Z 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2021-11-22T08:57:27.797037Z 0 [Note] Plugin 'FEDERATED' is disabled.
2021-11-22T08:57:27.809704Z 0 [Note] InnoDB: Starting recovery for XA transactions...
2021-11-22T08:57:27.809727Z 0 [Note] InnoDB: Transaction 12760 in prepared state after recovery
2021-11-22T08:57:27.809731Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2021-11-22T08:57:27.809736Z 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2021-11-22T08:57:27.809739Z 0 [Note] Found 1 prepared transaction(s) in InnoDB
2021-11-22T08:57:27.809753Z 0 [Warning] WSREP: Discovered discontinuity in recovered wsrep transaction XIDs. Truncating the recovery list to 0 entries
2021-11-22T08:57:27.809757Z 0 [Note] WSREP: Last wsrep seqno to be recovered 2656
2021-11-22T08:57:27.809852Z 0 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
2021-11-22T08:57:27.809862Z 0 [ERROR] Aborting
I don’t know what to do, since I’m using the operator installed with the chart pxc-operator and an instance installed with the chart pxc-db.
So, what should I do? And why the operator does not handle automatically this use case? Why only 1 replica of 3 is crash looping while others are OK? And why all the haproxy in front of PXC instances are unready (there is no HA so?)?
Hello @Antoine,
I would manually destroy that pod and let the operator recreate it so that it forces a fresh SST from one of the other nodes. Yes, I think the operator should handle this. Can you please open a bug report at https://jira.percona.com with all the config files and other info?
I see the default value of innodb_flush_log_at_trx_commit is 0. Could this be the problem? I have a very long transaction each day (about an hour). I think the instance crashed during this one.
The other problem that worries me is that all the haproxy in front of the 3 instances are in CrashLoopBackOff. Why? I think they should be OK because there are 2 of 3 PXC instances which are ready.