Joining new node: State transfer failed: -32 (Broken pipe)

I have a cluster with 2 nodes operating out of one site. I’m attempting to add a third node in a different site. All nodes are reachable from each other and manual entries with node IPs have been added to each node’s /etc/hosts file for good measure.

When I attempt start xtradb on the new node, I get the following error message: State transfer to 2.0 (site-b-mysql-01) failed: -32 (Broken pipe) - after which the mysql launch aborts and node fails to join cluster. Partial log below (due to post character limit). Full log is attached.


2017-08-08T03:57:35.246411Z 0 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(00000000-0000-0000-0000-000000000000:-1)
2017-08-08T03:57:35.247116Z 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2017-08-08T03:57:35.247193Z 0 [Note] WSREP: Preparing to initiate SST/IST
2017-08-08T03:57:35.247208Z 0 [Note] WSREP: Starting replication
2017-08-08T03:57:35.247259Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-08-08T03:57:35.247703Z 0 [Note] WSREP: Using CRC-32C for message checksums.
2017-08-08T03:57:35.247931Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2017-08-08T03:57:35.248181Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2017-08-08T03:57:35.248209Z 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2017-08-08T03:57:35.249548Z 0 [Note] WSREP: GMCast version 0
2017-08-08T03:57:35.250285Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-08-08T03:57:35.250320Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-08-08T03:57:35.251218Z 0 [Note] WSREP: EVS version 0
2017-08-08T03:57:35.251519Z 0 [Note] WSREP: gcomm: connecting to group 'stiu_mysql_cluster_a', peer '10.1.2.48:,10.1.2.49:,10.10.201.3:'
2017-08-08T03:57:35.254610Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') connection established to b63503ea tcp://10.10.201.3:4567
2017-08-08T03:57:35.254662Z 0 [Warning] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') address 'tcp://10.10.201.3:4567' points to own listening address, blacklisting
2017-08-08T03:57:35.291763Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') connection established to 5dbd24b5 tcp://10.1.2.48:4567
2017-08-08T03:57:35.292311Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2017-08-08T03:57:35.293081Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') connection established to 70236211 tcp://10.1.2.49:4567
2017-08-08T03:57:36.780672Z 0 [Note] WSREP: declaring 5dbd24b5 at tcp://10.1.2.48:4567 stable
2017-08-08T03:57:36.780768Z 0 [Note] WSREP: declaring 70236211 at tcp://10.1.2.49:4567 stable
2017-08-08T03:57:36.790072Z 0 [Note] WSREP: Node 5dbd24b5 state primary
2017-08-08T03:57:36.800489Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,5dbd24b5,5)
memb {
5dbd24b5,0
70236211,0
b63503ea,0
}
joined {
}
left {
}
partitioned {
}
)
2017-08-08T03:57:36.800545Z 0 [Note] WSREP: Save the discovered primary-component to disk
2017-08-08T03:57:37.253915Z 0 [Note] WSREP: gcomm: connected
2017-08-08T03:57:37.254197Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-08-08T03:57:37.254450Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2017-08-08T03:57:37.254568Z 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2017-08-08T03:57:37.254707Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2017-08-08T03:57:37.254863Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: ba338369-7bed-11e7-b3a4-c3ff8c880ac1
2017-08-08T03:57:37.254913Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: ba338369-7bed-11e7-b3a4-c3ff8c880ac1 from 0 (site-a-mysql-01)
2017-08-08T03:57:37.254939Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: ba338369-7bed-11e7-b3a4-c3ff8c880ac1 from 1 (site-a-mysql-02)
2017-08-08T03:57:37.264381Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: ba338369-7bed-11e7-b3a4-c3ff8c880ac1 from 2 (site-b-mysql-01)
2017-08-08T03:57:37.264430Z 0 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 4,
members = 2/3 (primary/total),
act_id = 374,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = 3bcfe97a-7071-11e7-8acc-4f4c64a0dc77
2017-08-08T03:57:37.264456Z 0 [Note] WSREP: Flow-control interval: [173, 173]
2017-08-08T03:57:37.264481Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 374)
2017-08-08T03:57:37.264646Z 2 [Note] WSREP: State transfer required: 
Group state: 3bcfe97a-7071-11e7-8acc-4f4c64a0dc77:374
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-08-08T03:57:37.264718Z 2 [Note] WSREP: New cluster view: global state: 3bcfe97a-7071-11e7-8acc-4f4c64a0dc77:374, view# 5: Primary, number of nodes: 3, my index: 2, protocol version 3
2017-08-08T03:57:37.264758Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-08-08T03:57:37.264934Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.201.3' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '4078' '' )
2017-08-08T03:57:38.006127Z 2 [Note] WSREP: Prepared SST/IST request: xtrabackup-v2|10.10.201.3:4444/xtrabackup_sst//1
2017-08-08T03:57:38.006176Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-08-08T03:57:38.006224Z 2 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-08-08T03:57:38.006248Z 2 [Note] WSREP: Assign initial position for certification: 374, protocol version: 3
2017-08-08T03:57:38.006422Z 0 [Note] WSREP: Service thread queue flushed.
2017-08-08T03:57:38.006531Z 2 [Note] WSREP: Check if state gap can be serviced using IST
2017-08-08T03:57:38.006578Z 2 [Note] WSREP: Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 3bcfe97a-7071-11e7-8acc-4f4c64a0dc77
2017-08-08T03:57:38.006716Z 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST
2017-08-08T03:57:38.016285Z 0 [Note] WSREP: Member 2.0 (site-b-mysql-01) requested state transfer from '*any*'. Selected 0.0 (site-a-mysql-01)(SYNCED) as donor.
2017-08-08T03:57:38.016314Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 374)
2017-08-08T03:57:38.016406Z 2 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-08-08T03:57:38.016461Z 2 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(3bcfe97a-7071-11e7-8acc-4f4c64a0dc77:374)
2017-08-08T03:57:38.547212Z 0 [Warning] WSREP: 0.0 (site-a-mysql-01): State transfer to 2.0 (site-b-mysql-01) failed: -32 (Broken pipe)
2017-08-08T03:57:38.547255Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():765: Will never receive state. Need to abort.
2017-08-08T03:57:38.547295Z 0 [Note] WSREP: gcomm: terminating thread
2017-08-08T03:57:38.547363Z 0 [Note] WSREP: gcomm: joining thread
2017-08-08T03:57:38.547568Z 0 [Note] WSREP: gcomm: closing backend
2017-08-08T03:57:38.567738Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,5dbd24b5,5)
memb {
b63503ea,0
}
joined {
}
left {
}
partitioned {
5dbd24b5,0
70236211,0
}
)
2017-08-08T03:57:38.567819Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') connection to peer b63503ea with addr tcp://10.10.201.3:4567 timed out, no messages seen in PT3S
2017-08-08T03:57:38.567966Z 0 [Note] WSREP: (b63503ea, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-08-08T03:57:38.568022Z 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2017-08-08T03:57:38.568266Z 0 [Note] WSREP: gcomm: closed
2017-08-08T03:57:38.568297Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Terminated
2017-08-08T03:57:48.538521Z WSREP_SST: [ERROR] Removing /tmp/pxc_sst_vnXlFmmy/joiner_yaHbN7W4/xtrabackup_galera_info file due to signal
2017-08-08T03:57:48.542962Z WSREP_SST: [ERROR] Removing file due to signal
2017-08-08T03:57:48.547503Z WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 
2017-08-08T03:57:48.549724Z WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 143 143
2017-08-08T03:57:48.551860Z WSREP_SST: [ERROR] ****************************************************** 
2017-08-08T03:57:48.554339Z WSREP_SST: [ERROR] Cleanup after exit with status:32

I’m beat as to what is causing the “broken pipe”. Any help will be greatly appreciated.

Thank you,
Sourjya

xtradb-node-join-log.txt (12.2 KB)

I encountered this issue as well, and eventually figured out (by looking at the mysql logs of the primary) that I had forgotten to open the necessary ports on the joiner. The primary needs those ports open to it so that it can send over the DB backup - namely to port 4444.

Here are the commands for a centos 7 machine:

firewall-cmd --permanent --zone=public --add-port=3306/tcp
firewall-cmd --permanent --zone=public --add-port=4444/tcp 
firewall-cmd --permanent --zone=public --add-port=4567/tcp 
firewall-cmd --permanent --zone=public --add-port=4568/tcp 
firewall-cmd --reload

obviously, those won’t work if you’re using customized firewall zones (like for an internal network) so extrapolate however you need.