Cluster is stuck after running Xtrabackup on one of it's nodes

Oleksandr_Bezpiatov · February 19, 2022, 2:55pm

Hi everyone.

Today we had the whole cluster stuck after running xtrabackup on pxc3 node of our cluster. After we discovered that whole cluster is PAUSED due to flow control, we stopped xtrabackup process on pxc3, but the problem was not resolved even after waiting for few minutes.

Then we decided to shutdown pxc3 (thought that concrete node was the root cause of the problem), but this was also not helpful at all.

The only way was to shutdown all the nodes, and bootstrap cluster from ground zero.

We are running cluster in master-slave mode (pxc1 is RW node, pxc2-3 = RO).
WSREP provider options are next:
wsrep_provider_options="gcache.size=2G;gcs.fc_master_slave=YES;socket.ssl_key=server-key.pem;socket.ssl_cert=server-cert.pem;socket.ssl_ca=ca.pem;pc.bootstrap=YES"

Who can advice what could be the root cause of such cluster deny of service and how to avoid such crashes in future? Thank you.

PMM PXC Cluster status screen

pmm and server time has 2 hours diff

Screenshot from 2022-02-19 16-54-171792×970 130 KB

Significant logs from server below:

pxc1

2022-02-19T11:36:01.882196Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074265, -110 (Connection timed out)
2022-02-19T11:36:13.909114Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:19.097684Z 12635837 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.101082Z 12635838 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.104036Z 12635839 [Warning] [MY-000000] [Server] Too many connections
......

pxc2

2022-02-19T11:35:32.010248Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074253, -110 (Connection timed out)
2022-02-19T11:36:26.785571Z 687112 [Warning] [MY-010055] [Server] IP address 'xxx.xxx.130.62' could not be resolved: Name or service not known
2022-02-19T11:36:30.574214Z 1 [Note] [MY-011825] [InnoDB] *** Priority TRANSACTION:
TRANSACTION 74917290641, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 
MySQL thread id 1, OS thread handle 140237758478080, query id 88790267 Applying batch of row changes (update)
2022-02-19T11:36:30.574353Z 1 [Note] [MY-011825] [InnoDB] *** Victim TRANSACTION:
TRANSACTION 74917290530, ACTIVE 8 sec
, undo log entries 1
MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1)
commit
2022-02-19T11:36:30.574413Z 1 [Note] [MY-011825] [InnoDB] *** WAITING FOR THIS LOCK TO BE GRANTED:
2022-02-19T11:36:30.574430Z 1 [Note] [MY-011825] [InnoDB]  SQL1: 
2022-02-19T11:36:30.574450Z 1 [Note] [MY-011825] [InnoDB]  SQL2: commit
2022-02-19T11:36:30.574470Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574488Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2022-02-19T11:36:30.574505Z 1 [Note] [MY-000000] [WSREP] Winning thread: 
   THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
   SQL: (null)

2022-02-19T11:36:30.574522Z 1 [Note] [MY-000000] [WSREP] Victim thread: 
   THD: 687016, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

2022-02-19T11:36:30.574889Z 1 [Note] [MY-011825] [InnoDB] Killed transaction: ID: 0 in hit list - MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1) commit
2022-02-19T11:36:30.574923Z 1 [Note] [MY-011825] [InnoDB] WSREP seqnos, BF: 2032074496, victim: -1
2022-02-19T11:36:30.574942Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574974Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2022-02-19T11:36:30.574990Z 1 [Note] [MY-000000] [WSREP] Winning thread: 
   THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
   SQL: (null)

2022-02-19T11:36:30.575004Z 1 [Note] [MY-000000] [WSREP] Victim thread: 
   THD: 687016, mode: local, state: exec, conflict: must_abort, seqno: -1
   SQL: commit

pxc3

2022-02-19T11:36:18.308462Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074244, -110 (Connection timed out)
2022-02-19T11:36:33.359968Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074282, -110 (Connection timed out)
2022-02-19T11:36:45.946583Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:58.280825Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074331, -110 (Connection timed out)
2022-02-19T11:37:08.982176Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074351, -110 (Connection timed out)
2022-02-19T11:37:16.108177Z 568894 [Note] [MY-010914] [Server] Aborted connection 568894 to db: 'unconnected' user: 'master' host: 'localhost' (Got an error reading communication packets).
2022-02-19T11:42:50.268427Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-14.1).
2022-02-19T11:42:50.268534Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN

matthewb · February 20, 2022, 1:20am

Please confirm versions of xtrabackup, and PXC. Also please provide the exact command used to take the backup. A PXC node in the process of a backup should desync from the cluster thereby turning off any flow control mechanism, preventing the very issue you are describing.

Oleksandr_Bezpiatov · February 20, 2022, 7:39am

Hi, here they are:

mysql  Ver 8.0.23-14.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel14, Revision d3b9a1d, WSREP version 26.4.3)

xtrabackup version 8.0.26-18 based on MySQL server 8.0.26 Linux (x86_64) (revision id: 4aecf82)

xtrabackup --backup --parallel=8 --compress --compress-threads=8 --no-server-version-check --target-dir=/backup/full

matthewb · February 22, 2022, 3:49pm

Remove this from your my.cnf. This parameter is only used during recovery situations.

Also, your PXC1 is reporting too many connections, can you increase max_connections? Every log you showed also has some network timeout related errors. You’re sure no firewall rules between the nodes?

Topic		Replies	Views
Percona XtraDB Cluster stucks when softfailed node returns to cluster Percona XtraDB Cluster 8.x mysql , percona , closed-no-reply	0	759	August 22, 2022
Node not joining cluster Percona XtraDB Cluster 5.x	1	2948	February 13, 2019
Cluster re-sync causes much lock Percona XtraDB Cluster 5.x mysql	2	882	June 23, 2021
New node is not able to join the cluster Percona XtraDB Cluster 8.x	1	1431	May 7, 2020
One node gets all queries stuck and collapse due to too many connections Percona XtraDB Cluster 5.x	8	3329	December 17, 2017

Cluster is stuck after running Xtrabackup on one of it's nodes

Related topics