Can't reconnect after server down

Hello, I’m trying to rejoin my server to a two server cluster. When I try to connect I get this error:

  • Stale sst_in_progress file in datadir mysqld * Starting MySQL (Percona XtraDB Cluster) database server mysqld * State transfer in progress, setting sleep higher mysqld * The server quit without upda ting PID file (/var/run/mysqld/mysqld.pid).
    [fail]

My log file shows this:

2017-08-15T13:32:50.426706Z 0 [Note] WSREP: (39a8c13b, ‘tcp://0.0.0.0:4567’) connection to peer 39a8c13b with addr tcp://172.25.0.7:4567 timed out, no messages seen in PT3S
2017-08-15T13:32:50.426858Z 0 [Note] WSREP: (39a8c13b, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
20170815 09:32:58.825 WSREP_SST: [INFO] …Waiting for SST streaming to complete!
20170815 09:32:58.958 WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
20170815 09:32:58.961 WSREP_SST: [ERROR] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log
20170815 09:32:58.962 WSREP_SST: [ERROR] ******************************************************
20170815 09:32:58.964 WSREP_SST: [ERROR] Cleanup after exit with status:2
2017-08-15T13:32:58.983723Z 0 [Warning] WSREP: 0.0 (pxc1): State transfer to 1.0 (pxc2) failed: -22 (Invalid argument)
2017-08-15T13:32:58.983749Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():765: Will never receive state. Need to abort.
2017-08-15T13:32:58.983781Z 0 [Note] WSREP: gcomm: terminating thread
2017-08-15T13:32:58.983798Z 0 [Note] WSREP: gcomm: joining thread
2017-08-15T13:32:58.983853Z 0 [Note] WSREP: gcomm: closing backend
2017-08-15T13:32:58.989127Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘172.25.0.7’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘3665’ ‘’ : 2 (No such file or directory)
2017-08-15T13:32:58.989185Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2017-08-15T13:32:58.989197Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2017-08-15T13:32:58.989220Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2017-08-15T13:32:58.989247Z 0 [ERROR] Aborting

2017-08-15T13:32:58.989254Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2017-08-15T13:32:58.989289Z 0 [Note] WSREP: SST request was cancelled
2017-08-15T13:32:58.989300Z 0 [Note] Giving 2 client threads a chance to die gracefully
2017-08-15T13:32:58.989382Z 2 [Note] WSREP: Closing send monitor…
2017-08-15T13:32:58.989404Z 2 [Note] WSREP: Closed send monitor.
2017-08-15T13:33:00.989467Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2017-08-15T13:33:00.989542Z 1 [Note] WSREP: rollbacker thread exiting
2017-08-15T13:33:01.989602Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2017-08-15T13:33:02.427468Z 0 [Note] WSREP: (39a8c13b, ‘tcp://0.0.0.0:4567’) connection to peer 36ab0f9d with addr tcp://172.25.0.14:4567 timed out, no messages seen in PT3S
2017-08-15T13:33:02.427654Z 0 [Note] WSREP: (39a8c13b, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.25.0.14:4567
2017-08-15T13:33:02.989761Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2017-08-15T13:33:03.927460Z 0 [Note] WSREP: (39a8c13b, ‘tcp://0.0.0.0:4567’) reconnecting to 36ab0f9d (tcp://172.25.0.14:4567), attempt 0
2017-08-15T13:33:03.989891Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2017-08-15T13:33:04.484122Z 0 [Note] WSREP: evs::proto(39a8c13b, LEAVING, view_id(REG,36ab0f9d,32)) suspecting node: 36ab0f9d
2017-08-15T13:33:04.484142Z 0 [Note] WSREP: evs::proto(39a8c13b, LEAVING, view_id(REG,36ab0f9d,32)) suspected node without join message, declaring inactive
2017-08-15T13:33:04.484172Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,36ab0f9d,32)
memb {
39a8c13b,0
}
joined {
}
left {
}
partitioned {
36ab0f9d,0
}
)
2017-08-15T13:33:04.484209Z 0 [Note] WSREP: Current view of cluster as seen by this node

And when i view the log file on the donor I don’t see any attempt made to join the cluster. Any idea what I’m doing wrong here?

That’s weird.
a. Do you have n/w connectivity issue among 2 nodes.
b. DONOR node would generally have the needed failure captured in logs. It is bit weird to see that DONOR is not being pinged. Do you see other node entering DONOR state or it continue to remain in SYNCED state. Can you re-check IP configuration.

This occurred when the primary node ran out of disk space and cause the percona server to crash stop responding. There are no connectivity issues between the 2 nodes, both are on the same gigabit switch and are syncing files for a web service as well.

Also, the donor node has not updated it’s logs since the failure, but it continues to run.

Unfortunately, in general MySQL is not good at handling such failures (out of disk space) is one such issue. I assume you are able to restore your cluster after correcting the said failure of providing more disk space.

No, when I try to store the cluster I get the errors above. I will end up taking the entire cluster down and starting from scratch since none of the recovery procedures seem to work correctly yet ( for me anyways ).