"exception in PC" on node 1 -> whole 3 node cluster froze

I have a 3 node setup: nodes 1 and 2 in datacenter A, node 3 in datacenter B

Today, node 1 failed after a lot of partitioning and resyncing of the cluster with the following message.

Thereafter, the whole cluster froze with a some of the following messages on nodes 2 and 3:
2014-02-16 12:57:47 2595 [Note] WSREP: Nodes 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d are still in unknown state, unable to rebootstrap new prim

Does anyone have some ideas on how to solve this issue?

Thanks!

Frank.

ERROR ON NODE 1:

2014-02-16 12:41:02 2504 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,start_prim=0,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=4,checksum=0,instances=
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=1,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
6811f0c9-9367-11e3-9044-cb32ea280bb8,prim=0,un=0,last_seq=1,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=2
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=1,last_seq=43,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
,state_msgs=
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=0,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=0,last_seq=43,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
}}
6811f0c9-9367-11e3-9044-cb32ea280bb8,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=1,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=1
6811f0c9-9367-11e3-9044-cb32ea280bb8,prim=0,un=0,last_seq=1,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=2
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=1
}}
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=0,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=0,last_seq=43,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
}}
,current_view=view(view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49) memb {
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,0
6811f0c9-9367-11e3-9044-cb32ea280bb8,0
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,0
} joined {
6811f0c9-9367-11e3-9044-cb32ea280bb8,0
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46) memb {
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,1
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,1
} joined {
} left {
} partitioned {
}),mtu=32636}
2014-02-16 12:41:02 2504 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=bbce851d-9367-11e3-8a0e-9a5107ef8b9f,source_view_id=view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=3202335,node_list=()
} 116
2014-02-16 12:41:02 2504 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=6811f0c9-9367-11e3-9044-cb32ea280bb8,source_view_id=view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=3202238,node_list=()
}
state after handling message: evs::proto(evs::proto(62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d, OPERATIONAL, view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49)), OPERATIONAL) {
current_view=view(view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49) memb {
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,0
6811f0c9-9367-11e3-9044-cb32ea280bb8,0
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
fifo_seq=3203297,
last_sent=0,
known={
62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,}
6811f0c9-9367-11e3-9044-cb32ea280bb8,evs::node{operational=1,suspected=0,installed=1,fifo_seq=3202238,}
bbce851d-9367-11e3-8a0e-9a5107ef8b9f,evs::node{operational=1,suspected=0,installed=1,fifo_seq=3202337,}
}
}2014-02-16 12:41:02 2504 [ERROR] WSREP: exception from gcomm, backend must be restarted: msg_state == local_state: 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d node 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d prim state message and local states not consistent: msg node prim=1,un=0,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1 local state prim=1,un=1,last_seq=4,last_prim=view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1 (FATAL)
at gcomm/src/pc_proto.cpp:validate_state_msgs():607
2014-02-16 12:41:02 2504 [Note] WSREP: Received self-leave message.
2014-02-16 12:41:02 2504 [Note] WSREP: Flow-control interval: [0, 0]
2014-02-16 12:41:02 2504 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-02-16 12:41:02 2504 [Note] WSREP: Shifting SYNCED → CLOSED (TO: 2988725)
2014-02-16 12:41:02 2504 [Note] WSREP: RECV thread exiting 0: Success
2014-02-16 12:41:02 2504 [Note] WSREP: New cluster view: global state: 5dd126ae-2944-11e3-9d8e-a65147a95bff:2988725, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
2014-02-16 12:41:17 2504 [Note] WSREP: applier thread exiting (code:0)
2014-02-16 16:17:15 2504 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2014-02-16 16:17:15 2504 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2014-02-16 16:25:03 2504 [Note] /usr/sbin/mysqld: Normal shutdown

Well, I moved my third node to the same datacenter as nodes 1 and 2. Hope that this works as work-around… Although this is not a solution and I thought that XtraDB Cluster would be capable of working with nodes distributed across datacenters.