Hey we have a cross WAN setup in the works.
Site1 : Node1 Node2 Node3
Site2 : Node1
We went ahead and set the my.cnf for Site2:Node1 to say wsrep_sst_donor=Site1:Node3 .
Note: Initially we got caught up for a while trying to use IP address. Had to be hostname or node name from Node3’s config…oops. All good now though.
OK So Site2:Node1 starts the join process as seen here:
151111 17:58:29 [Note] WSREP: Node 3 (hqpercona1.hq.example.com) requested state transfer from 'balpercona3.bal.example.com'. Selected 0 (balpercona3.bal.example.com)(SYNCED) as donor.
151111 17:58:29 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 269058800)
151111 17:58:29 [Note] WSREP: Requesting state transfer: success, donor: 0
Shows a standard JOINER status. SST on the node starting mysql looks good.
OK so an over an hour later, SST completes…but fails to start service.
151111 19:16:09 [Warning] WSREP: 0 (balpercona3.example.com): State transfer to 3 (hqpercona1.hq.example.com) failed: -1 (Operation not permitted)
151111 19:16:09 [ERROR] WSREP: gcs/src/gcs_group.cpp:long int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():717: Will never receive state. Need to abort.
151111 19:16:09 [Note] WSREP: gcomm: terminating thread
151111 19:16:09 [Note] WSREP: gcomm: joining thread
151111 19:16:09 [Note] WSREP: gcomm: closing backend
151111 19:16:09 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.16.52.11:4567 tcp://172.16.52.12:4567 tcp://172.16.52.13:4567 tcp://192.168.35.11:4567 tcp://192.168.35.12:4567 tcp://192.168.35.13:4567
151111 19:16:09 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://172.16.52.12:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:10 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:11 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://192.168.35.11:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 85b0c608 (tcp://192.168.35.12:4567), attempt 0
151111 19:16:13 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://192.168.35.13:4567), attempt 0
151111 19:16:14 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 6c4181be (tcp://172.16.52.11:4567), attempt 0
151111 19:16:14 [Note] WSREP: (b6dff4c3, 'tcp://0.0.0.0:4567') reconnecting to 31225cf2 (tcp://172.16.52.13:4567), attempt 0
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 31225cf2
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 6c4181be
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspecting node: 85b0c608
151111 19:16:14 [Note] WSREP: evs::proto(b6dff4c3, LEAVING, view_id(REG,31225cf2,60)) suspected node without join message, declaring inactive
151111 19:16:14 [Note] WSREP: gcomm: closed
151111 19:16:14 [Note] WSREP: /usr/sbin/mysqld: Terminated.
151111 19:16:14 mysqld_safe mysqld from pid file /var/lib/mysql/hqpercona1.hq.example.com.pid ended
So my question would be…why do you think the final operation is failing?