SST donor being removed from cluster by donee?

I want to check whether this is a bug to report via the Issues, or something I have misconfigured.

I got into a situation where I had one server up (A) and needed to do SSTs to the other two (B, C) in the cluster. A was under a fair amount of load as the only server left running. Repeatedly, when trying to serve an SST to B, A entered the “WSREP has not yet prepared node for application use” state. C was down throughout.

This meant that the donor and hence live system was unusable, so initially I interrupted the SST. Eventually I left the system down and let the SST continue, but it eventually failed, also with:

180202 05:53:11 [01] …done
180202 05:53:11 Executing LOCK TABLES FOR BACKUP…
Error: failed to execute query LOCK TABLES FOR BACKUP: WSREP has not yet prepared node for application use
180202 05:53:11 >> log scanned up to (10865793398401)

Eventually I took other steps to reduce the load on A to the point where the SST could complete.

What I surmise is happening here is that A was under so much load that it wasn’t responsive, and then we had a split brain between A and B. I don’t know the logic well, so this may be simplistic or wrong. If so, though, it seems wrong - both A and B know that A has a working copy of the database and that A is serving to B which doesn’t.

I have a logfile from A for this (attached); an example is around 2018-02-02T03:10:47.399900Z (143 KB)


I went over the said details and let me tell you that the said scenario is valid and probably something is wrong in configuration or run that is probably causing the problem.

I also went over the said log file. It has multiple run output and last output is in a half-cooked state so difficult to predict what happened to the last instance of SST try.

Anyway, let me put forth what I understood from the log file

  1. It seems like you have a flaky connection among the nodes.

20180202 02:04:08.171 WSREP_SST: [INFO] Streaming GTID file before SST
20180202 02:04:18.697 WSREP_SST: [INFO] Streaming the backup to joiner at 4444
2018/02/02 02:04:18 socat[18106] E connect(3, AF=2, 16): Connection refused

  1. Your workload (fortunately, it is blocked due to pxc_strict_mode=ENFORCING) seems to take local locks.

2018-02-02T11:47:35.845518Z 1167017 [ERROR] WSREP: Percona-XtraDB-Cluster prohibits use of GET_LOCK with pxc_strict_mode = ENFORCING

  1. Also, there is involvement of non-transactional table.

  2. Your DONOR node seems to enter a state that doesn’t allow it to service query. From the log file you can find out that just before the message the node entered non-primary.
    This is likely again due to flaky n/w connection.

2018-02-02T06:02:54.111673Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,33ea1383,12)
memb {
joined {
left {
partitioned {
2018-02-02T06:02:54.370605Z 0 [Note] WSREP: Current view of cluster as seen by this node

With that all input I would first hunt to make n/w stable.

Very belatedly, I realise I never thanked you for this. What was happening, best I can tell, is that there were disk issues on the servers which were causing nodes to run very slowly, and therefore to drop network calls.

Speed up the servers, everything works nicely. Thanks for you time in looking at this.

On behalf of Krunal and the team, thanks for remembering.

It’s never too late and always appreciated. :smiley: