Hi everyone.
Today we had the whole cluster stuck after running xtrabackup
on pxc3
node of our cluster. After we discovered that whole cluster is PAUSED due to flow control, we stopped xtrabackup
process on pxc3
, but the problem was not resolved even after waiting for few minutes.
Then we decided to shutdown pxc3
(thought that concrete node was the root cause of the problem), but this was also not helpful at all.
The only way was to shutdown all the nodes, and bootstrap cluster from ground zero.
We are running cluster in master-slave mode (pxc1 is RW node, pxc2-3 = RO).
WSREP provider options are next:
wsrep_provider_options="gcache.size=2G;gcs.fc_master_slave=YES;socket.ssl_key=server-key.pem;socket.ssl_cert=server-cert.pem;socket.ssl_ca=ca.pem;pc.bootstrap=YES"
Who can advice what could be the root cause of such cluster deny of service and how to avoid such crashes in future? Thank you.
PMM PXC Cluster status screen
- pmm and server time has 2 hours diff
Significant logs from server below:
pxc1
2022-02-19T11:36:01.882196Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074265, -110 (Connection timed out)
2022-02-19T11:36:13.909114Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:19.097684Z 12635837 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.101082Z 12635838 [Warning] [MY-000000] [Server] Too many connections
2022-02-19T11:36:19.104036Z 12635839 [Warning] [MY-000000] [Server] Too many connections
......
pxc2
2022-02-19T11:35:32.010248Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074253, -110 (Connection timed out)
2022-02-19T11:36:26.785571Z 687112 [Warning] [MY-010055] [Server] IP address 'xxx.xxx.130.62' could not be resolved: Name or service not known
2022-02-19T11:36:30.574214Z 1 [Note] [MY-011825] [InnoDB] *** Priority TRANSACTION:
TRANSACTION 74917290641, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT
MySQL thread id 1, OS thread handle 140237758478080, query id 88790267 Applying batch of row changes (update)
2022-02-19T11:36:30.574353Z 1 [Note] [MY-011825] [InnoDB] *** Victim TRANSACTION:
TRANSACTION 74917290530, ACTIVE 8 sec
, undo log entries 1
MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1)
commit
2022-02-19T11:36:30.574413Z 1 [Note] [MY-011825] [InnoDB] *** WAITING FOR THIS LOCK TO BE GRANTED:
2022-02-19T11:36:30.574430Z 1 [Note] [MY-011825] [InnoDB] SQL1:
2022-02-19T11:36:30.574450Z 1 [Note] [MY-011825] [InnoDB] SQL2: commit
2022-02-19T11:36:30.574470Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574488Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
2022-02-19T11:36:30.574505Z 1 [Note] [MY-000000] [WSREP] Winning thread:
THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
SQL: (null)
2022-02-19T11:36:30.574522Z 1 [Note] [MY-000000] [WSREP] Victim thread:
THD: 687016, mode: local, state: exec, conflict: certifying, seqno: -1
SQL: commit
2022-02-19T11:36:30.574889Z 1 [Note] [MY-011825] [InnoDB] Killed transaction: ID: 0 in hit list - MySQL thread id 687016, OS thread handle 140177082525440, query id 88786684 xxx.xxx.130.18 dbconnection wsrep: replicating and certifying write set(-1) commit
2022-02-19T11:36:30.574923Z 1 [Note] [MY-011825] [InnoDB] WSREP seqnos, BF: 2032074496, victim: -1
2022-02-19T11:36:30.574942Z 1 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2022-02-19T11:36:30.574974Z 1 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
2022-02-19T11:36:30.574990Z 1 [Note] [MY-000000] [WSREP] Winning thread:
THD: 1, mode: high priority, state: exec, conflict: executing, seqno: 2032074496
SQL: (null)
2022-02-19T11:36:30.575004Z 1 [Note] [MY-000000] [WSREP] Victim thread:
THD: 687016, mode: local, state: exec, conflict: must_abort, seqno: -1
SQL: commit
pxc3
2022-02-19T11:36:18.308462Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074244, -110 (Connection timed out)
2022-02-19T11:36:33.359968Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074282, -110 (Connection timed out)
2022-02-19T11:36:45.946583Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074305, -110 (Connection timed out)
2022-02-19T11:36:58.280825Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074331, -110 (Connection timed out)
2022-02-19T11:37:08.982176Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed c824c5ee-1bca-11ec-b561-6f4444c8643f:2032074351, -110 (Connection timed out)
2022-02-19T11:37:16.108177Z 568894 [Note] [MY-010914] [Server] Aborted connection 568894 to db: 'unconnected' user: 'master' host: 'localhost' (Got an error reading communication packets).
2022-02-19T11:42:50.268427Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-14.1).
2022-02-19T11:42:50.268534Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN