SST donor being removed from cluster by donee?

Edward_Hibbert · February 3, 2018, 4:02am

I want to check whether this is a bug to report via the Issues, or something I have misconfigured.

I got into a situation where I had one server up (A) and needed to do SSTs to the other two (B, C) in the cluster. A was under a fair amount of load as the only server left running. Repeatedly, when trying to serve an SST to B, A entered the “WSREP has not yet prepared node for application use” state. C was down throughout.

This meant that the donor and hence live system was unusable, so initially I interrupted the SST. Eventually I left the system down and let the SST continue, but it eventually failed, also with:

180202 05:53:11 [01] …done
180202 05:53:11 Executing LOCK TABLES FOR BACKUP…
Error: failed to execute query LOCK TABLES FOR BACKUP: WSREP has not yet prepared node for application use
180202 05:53:11 >> log scanned up to (10865793398401)

Eventually I took other steps to reduce the load on A to the point where the SST could complete.

What I surmise is happening here is that A was under so much load that it wasn’t responsive, and then we had a split brain between A and B. I don’t know the logic well, so this may be simplistic or wrong. If so, though, it seems wrong - both A and B know that A has a working copy of the database and that A is serving to B which doesn’t.

I have a logfile from A for this (attached); an example is around 2018-02-02T03:10:47.399900Z

db2.ilovefreegle.org.zip (143 KB)

krunalbauskar · February 5, 2018, 5:52am

Hello,

I went over the said details and let me tell you that the said scenario is valid and probably something is wrong in configuration or run that is probably causing the problem.

I also went over the said log file. It has multiple run output and last output is in a half-cooked state so difficult to predict what happened to the last instance of SST try.

Anyway, let me put forth what I understood from the log file

It seems like you have a flaky connection among the nodes.

20180202 02:04:08.171 WSREP_SST: [INFO] Streaming GTID file before SST
20180202 02:04:18.697 WSREP_SST: [INFO] Streaming the backup to joiner at 212.110.173.53 4444
2018/02/02 02:04:18 socat[18106] E connect(3, AF=2 212.110.173.53:4444, 16): Connection refused

Your workload (fortunately, it is blocked due to pxc_strict_mode=ENFORCING) seems to take local locks.

2018-02-02T11:47:35.845518Z 1167017 [ERROR] WSREP: Percona-XtraDB-Cluster prohibits use of GET_LOCK with pxc_strict_mode = ENFORCING

Also, there is involvement of non-transactional table.
Your DONOR node seems to enter a state that doesn’t allow it to service query. From the log file you can find out that just before the message the node entered non-primary.
This is likely again due to flaky n/w connection.

2018-02-02T06:02:54.111673Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,33ea1383,12)
memb {
33ea1383,0
}
joined {
}
left {
}
partitioned {
aee07923,0
}
)
2018-02-02T06:02:54.370605Z 0 [Note] WSREP: Current view of cluster as seen by this node

With that all input I would first hunt to make n/w stable.

Edward_Hibbert · February 20, 2019, 7:30am

Very belatedly, I realise I never thanked you for this. What was happening, best I can tell, is that there were disk issues on the servers which were causing nodes to run very slowly, and therefore to drop network calls.

Speed up the servers, everything works nicely. Thanks for you time in looking at this.

lorraine.pocklington · February 20, 2019, 10:09am

On behalf of Krunal and the team, thanks for remembering.

It’s never too late and always appreciated.

Topic		Replies	Views
The donor node gets locked up during SST Percona XtraDB Cluster 5.x	1	625	July 25, 2016
Crashed Cluste, Nodes not Joinig Percona XtraDB Cluster 5.x percona	9	1144	September 8, 2022
SST Failure XtraDB Cluster 5.6.28 Percona XtraDB Cluster 5.x	1	679	November 9, 2016
Stalled on STT even when other nodes are down Percona XtraDB Cluster 5.x mysql , percona	2	1143	December 14, 2021
The document description conflict Percona XtraDB Cluster 8.x	1	603	February 14, 2023

SST donor being removed from cluster by donee?

Related topics