Cluster is stuck after running Xtrabackup on one of it's nodes

Hi everyone.

Today we had the whole cluster stuck after running xtrabackup on pxc3 node of our cluster. After we discovered that whole cluster is PAUSED due to flow control, we stopped xtrabackup process on pxc3, but the problem was not resolved even after waiting for few minutes.

Then we decided to shutdown pxc3 (thought that concrete node was the root cause of the problem), but this was also not helpful at all.

The only way was to shutdown all the nodes, and bootstrap cluster from ground zero.

We are running cluster in master-slave mode (pxc1 is RW node, pxc2-3 = RO).
WSREP provider options are next:
wsrep_provider_options="gcache.size=2G;gcs.fc_master_slave=YES;socket.ssl_key=server-key.pem;socket.ssl_cert=server-cert.pem;socket.ssl_ca=ca.pem;pc.bootstrap=YES"

Who can advice what could be the root cause of such cluster deny of service and how to avoid such crashes in future? Thank you.

PMM PXC Cluster status screen

Significant logs from server below:

pxc1

2022-02-19T11:36:01.882196Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074265, -110 (Connection timed out)
2022-02-19T11:36:13.909114Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:19.097684Z 12635837 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.101082Z 12635838 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.104036Z 12635839 [Warning] [MY-000000] [Server] Too many connections
......

pxc2

2022-02-19T11:35:32.010248Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074253, -110 (Connection timed out)
2022-02-19T11:36:26.785571Z 687112 [Warning] [MY-010055] [Server] IP address 'xxx.xxx.130.62' could not be resolved: Name or service not known
2022-02-19T11:36:30.574214Z 1 [Note] [MY-011825] [InnoDB] *** Priority TRANSACTION:
TRANSACTION 74917290641, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 
MySQL thread id 1, OS thread handle 140237758478080, query id 88790267 Applying batch of row changes (update)
2022-02-19T11:36:30.574353Z 1 [Note] [MY-011825] [InnoDB] *** Victim TRANSACTION:
TRANSACTION 74917290530, ACTIVE 8 sec
, undo log entries 1
MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1)
commit
2022-02-19T11:36:30.574413Z 1 [Note] [MY-011825] [InnoDB] *** WAITING FOR THIS LOCK TO BE GRANTED:
2022-02-19T11:36:30.574430Z 1 [Note] [MY-011825] [InnoDB]  SQL1: 
2022-02-19T11:36:30.574450Z 1 [Note] [MY-011825] [InnoDB]  SQL2: commit
2022-02-19T11:36:30.574470Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574488Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2022-02-19T11:36:30.574505Z 1 [Note] [MY-000000] [WSREP] Winning thread: 
   THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
   SQL: (null)

2022-02-19T11:36:30.574522Z 1 [Note] [MY-000000] [WSREP] Victim thread: 
   THD: 687016, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

2022-02-19T11:36:30.574889Z 1 [Note] [MY-011825] [InnoDB] Killed transaction: ID: 0 in hit list - MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1) commit
2022-02-19T11:36:30.574923Z 1 [Note] [MY-011825] [InnoDB] WSREP seqnos, BF: 2032074496, victim: -1
2022-02-19T11:36:30.574942Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574974Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2022-02-19T11:36:30.574990Z 1 [Note] [MY-000000] [WSREP] Winning thread: 
   THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
   SQL: (null)

2022-02-19T11:36:30.575004Z 1 [Note] [MY-000000] [WSREP] Victim thread: 
   THD: 687016, mode: local, state: exec, conflict: must_abort, seqno: -1
   SQL: commit

pxc3

2022-02-19T11:36:18.308462Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074244, -110 (Connection timed out)
2022-02-19T11:36:33.359968Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074282, -110 (Connection timed out)
2022-02-19T11:36:45.946583Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:58.280825Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074331, -110 (Connection timed out)
2022-02-19T11:37:08.982176Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074351, -110 (Connection timed out)
2022-02-19T11:37:16.108177Z 568894 [Note] [MY-010914] [Server] Aborted connection 568894 to db: 'unconnected' user: 'master' host: 'localhost' (Got an error reading communication packets).
2022-02-19T11:42:50.268427Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-14.1).
2022-02-19T11:42:50.268534Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
1 Like

Please confirm versions of xtrabackup, and PXC. Also please provide the exact command used to take the backup. A PXC node in the process of a backup should desync from the cluster thereby turning off any flow control mechanism, preventing the very issue you are describing.

1 Like

Hi, here they are:

mysql  Ver 8.0.23-14.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel14, Revision d3b9a1d, WSREP version 26.4.3)
xtrabackup version 8.0.26-18 based on MySQL server 8.0.26 Linux (x86_64) (revision id: 4aecf82)
xtrabackup --backup --parallel=8 --compress --compress-threads=8 --no-server-version-check --target-dir=/backup/full
1 Like

Remove this from your my.cnf. This parameter is only used during recovery situations.

Also, your PXC1 is reporting too many connections, can you increase max_connections? Every log you showed also has some network timeout related errors. You’re sure no firewall rules between the nodes?

1 Like