Percona Cluster -- new node fails to start after full SST

Hey we have a cross WAN setup in the works.

Site1 : Node1 Node2 Node3
Site2 : Node1

We went ahead and set the my.cnf for Site2:Node1 to say wsrep_sst_donor=Site1:Node3 .
Note: Initially we got caught up for a while trying to use IP address. Had to be hostname or node name from Node3’s config…oops. All good now though.

OK So Site2:Node1 starts the join process as seen here:

151111 17:58:29 [Note] WSREP: Node 3 (hqpercona1.hq.example.com) requested state transfer from 'balpercona3.bal.example.com'. Selected 0 (balpercona3.bal.example.com)(SYNCED) as donor.
151111 17:58:29 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 269058800)
151111 17:58:29 [Note] WSREP: Requesting state transfer: success, donor: 0

Shows a standard JOINER status. SST on the node starting mysql looks good.

OK so an over an hour later, SST completes…but fails to start service.

151111 19:16:09 [Warning] WSREP: 0 (balpercona3.example.com): State transfer to 3 (hqpercona1.hq.example.com) failed: -1 (Operation not permitted)
151111 19:16:09 [ERROR] WSREP: gcs/src/gcs_group.cpp:long int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():717: Will never receive state. Need to abort.
151111 19:16:09 [Note] WSREP: gcomm: terminating thread
151111 19:16:09 [Note] WSREP: gcomm: joining thread
151111 19:16:09 [Note] WSREP: gcomm: closing backend
151111 19:16:09 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.52.11:4567 tcp://172.16.52.12:4567 tcp://172.16.52.13:4567 tcp://192.168.35.11:4567 tcp://192.168.35.12:4567 tcp://192.168.35.13:4567
151111 19:16:09 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://172.16.52.12:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:14 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://172.16.52.11:4567), attempt 0
151111 19:16:14 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://172.16.52.13:4567), attempt 0
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 31225cf2
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 6c4181be
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 85b0c608
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: gcomm: closed
151111 19:16:14 [Note] WSREP: /usr/sbin/mysqld: Terminated.
151111 19:16:14 mysqld_safe mysqld from pid file /var/lib/mysql/hqpercona1.hq.example.com.pid ended

So my question would be…why do you think the final operation is failing?

There are some missing logs here, and usually the best way to diagnose a problem is if you have all the logs from all members of the cluster. This can mean a lot of things but would be hard to come up with a root cause without looking at the entire error log.