weird behavior of IST syncronization

Hello,

I have a xtradb cluster distributed between 3 nodes and one of the datacenters recently had couple internet outages which were causing node there to run out of sync. For a few days I were trying to figure out why IST sync were failing back to SST.
After bunch of tries, errors and starts problem were resolved on its own. Any ideas what it could be? The way I was doing a test were to block incoming and outgoing requests to rest of the cluster nodes with iptables

Failed IST sync:
018-01-10T04:34:07.381530Z 0 [Note] WSREP: Flow-control interval: [173, 173]
2018-01-10T04:34:07.381538Z 0 [Note] WSREP: Trying to continue unpaused monitor
2018-01-10T04:34:07.381546Z 0 [Note] WSREP: Shifting OPEN → PRIMARY (TO: 1687849)
2018-01-10T04:34:07.381688Z 1 [Note] WSREP: State transfer required:
Group state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849
Local state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687846
2018-01-10T04:34:07.381722Z 1 [Note] WSREP: New cluster view: global state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849, view# 243: Primary, number of nodes: 3, my index: 2, protocol version 3
2018-01-10T04:34:07.381732Z 1 [Note] WSREP: Setting wsrep_ready to true
2018-01-10T04:34:07.381740Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-01-10T04:34:07.381746Z 1 [Note] WSREP: Setting wsrep_ready to false
2018-01-10T04:34:07.381755Z 1 [Note] WSREP: You have configured ‘xtrabackup-v2’ state snapshot transfer method which cannot be performed on a running server. Wsrep provider won’t be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2018-01-10T04:34:07.381764Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-01-10T04:34:07.381777Z 1 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-01-10T04:34:07.381831Z 1 [Note] WSREP: Assign initial position for certification: 1687849, protocol version: 3
2018-01-10T04:34:07.382003Z 0 [Note] WSREP: Service thread queue flushed.
2018-01-10T04:34:07.382121Z 1 [Note] WSREP: Check if state gap can be serviced using IST
2018-01-10T04:34:07.382261Z 1 [Note] WSREP: IST receiver addr using tcp://publicip.com:4568
2018-01-10T04:34:07.382358Z 1 [Note] WSREP: IST receiver bind using tcp://dbb:4568
2018-01-10T04:34:07.382594Z 1 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.0.51:4568
2018-01-10T04:34:07.382624Z 1 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
2018-01-10T04:34:07.426046Z 0 [Note] WSREP: Member 2.0 (publicip.com) requested state transfer from ‘any’. Selected 1.0 (dbC)(SYNCED) as donor.
2018-01-10T04:34:07.426086Z 0 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 1687849)
2018-01-10T04:34:07.426226Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
2018-01-10T04:34:07.426270Z 1 [Note] WSREP: GCache history reset: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687846 → ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849
2018-01-10T04:34:07.432080Z 1 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2018-01-10T04:34:07.440393Z 1 [Note] WSREP: Receiving IST: 3 writesets, seqnos 1687846-1687849
2018-01-10T04:34:07.445561Z 0 [Note] WSREP: (91457fd6, ‘tcp://0.0.0.0:4567’) connection to peer 00000000 with addr tcp://dbcpublicip.com:4567 timed out, no messages seen in PT3S
2018-01-10T04:34:07.576675Z 0 [ERROR] WSREP: got error while reading ist stream: asio.system:104
2018-01-10T04:34:07.576725Z 0 [ERROR] WSREP: IST didn’t contain all write sets, expected last: 1687849 last received: 1687846
2018-01-10T04:34:07.576876Z 1 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
at galera/src/ist.cpp:recv():558
2018-01-10T04:34:07.579924Z 0 [Note] WSREP: 1.0 (dbC): State transfer to 2.0 (publicip.com) complete.
2018-01-10T04:34:07.623296Z 0 [Note] WSREP: Member 1.0 (dbC) synced with group.
2018-01-10T04:34:07.627104Z 1 [Note] WSREP: Closing send monitor…
2018-01-10T04:34:07.627132Z 1 [Note] WSREP: Closed send monitor.
2018-01-10T04:34:07.627147Z 1 [Note] WSREP: gcomm: terminating thread
2018-01-10T04:34:07.627163Z 1 [Note] WSREP: gcomm: joining thread
2018-01-10T04:34:07.627300Z 1 [Note] WSREP: gcomm: closing backend

Successful IST sync:

2018-01-10T04:59:10.932164Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-01-10T04:59:10.932171Z 1 [Note] WSREP: Setting wsrep_ready to false
2018-01-10T04:59:10.932181Z 1 [Note] WSREP: You have configured ‘xtrabackup-v2’ state snapshot transfer method which cannot be performed on a running server. Wsrep provider won’t be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2018-01-10T04:59:10.932205Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-01-10T04:59:10.932220Z 1 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-01-10T04:59:10.932251Z 1 [Note] WSREP: Assign initial position for certification: 1687902, protocol version: 3
2018-01-10T04:59:10.932387Z 0 [Note] WSREP: Service thread queue flushed.
2018-01-10T04:59:10.932506Z 1 [Note] WSREP: Check if state gap can be serviced using IST
2018-01-10T04:59:10.932630Z 1 [Note] WSREP: IST receiver addr using tcp://publicdomain.com:4568
2018-01-10T04:59:10.932706Z 1 [Note] WSREP: IST receiver bind using tcp://dbb:4568
2018-01-10T04:59:10.932903Z 1 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.0.51:4568
2018-01-10T04:59:10.932931Z 1 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
2018-01-10T04:59:10.974374Z 0 [Note] WSREP: Member 2.0 (publicdomain.com) requested state transfer from ‘any’. Selected 1.0 (dbC)(SYNCED) as donor.
2018-01-10T04:59:10.974404Z 0 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 1687902)
2018-01-10T04:59:10.974510Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
2018-01-10T04:59:10.974550Z 1 [Note] WSREP: GCache history reset: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687899 → ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687902
2018-01-10T04:59:10.978535Z 1 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2018-01-10T04:59:10.986850Z 1 [Note] WSREP: Receiving IST: 3 writesets, seqnos 1687899-1687902
2018-01-10T04:59:11.123203Z 0 [Note] WSREP: Receiving IST… 0.0% (0/3 events) complete.
2018-01-10T04:59:11.123483Z 0 [Note] WSREP: 1.0 (dbC): State transfer to 2.0 (publicdomain.com) complete.
2018-01-10T04:59:11.165109Z 0 [Note] WSREP: Member 1.0 (dbC) synced with group.
2018-01-10T04:59:11.203793Z 0 [Note] WSREP: Receiving IST…100.0% (3/3 events) complete.
2018-01-10T04:59:11.254440Z 1 [Note] WSREP: IST received: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687902
2018-01-10T04:59:11.296028Z 0 [Note] WSREP: 2.0 (publicdomain.com): State transfer from 1.0 (dbC) complete.
2018-01-10T04:59:11.296061Z 0 [Note] WSREP: SST leaving flow control
2018-01-10T04:59:11.296075Z 0 [Note] WSREP: Shifting JOINER → JOINED (TO: 1687902)
2018-01-10T04:59:11.337483Z 0 [Note] WSREP: Member 2.0 (publicdomain.com) synced with group.
2018-01-10T04:59:11.337514Z 0 [Note] WSREP: Shifting JOINED → SYNCED (TO: 1687902)
2018-01-10T04:59:11.337644Z 1 [Note] WSREP: Synchronized with group, ready for connections
2018-01-10T04:59:11.337671Z 1 [Note] WSREP: This node is synced, setting wsrep_ready to true

PXC needs stable n/w connectivity. Failing which PXC can recover but may fall back to SST if IST is abruptly interrupted.
It seems like you are likely going through the same scenario.

2018-01-10T04:34:07.576675Z 0 [ERROR] WSREP: got error while reading ist stream: asio.system:104
2018-01-10T04:34:07.576725Z 0 [ERROR] WSREP: IST didn’t contain all write sets, expected last: 1687849 last received: 1687846
2018-01-10T04:34:07.576876Z 1 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
at galera/src/ist.cpp:recv():558