Hello,
I have a running cluster with 3 nodes for 8 months. I’m using it only as Master / Backup / Backup (My Applications don’t handle very well mult-master-write) with HAProxy. I’m happy with this configuration, i can update everything (kernel every week…), doing some backup / restore for testing purpose on backup node etc…
Today i had my first crash :
- in the morning i’m losing one node C but my monitor didn’t alert me and i was on another issue…
- in the afternoon, a little latency in network and node A and node B just stop :
NODE A :
2018-09-17T12:36:30.074323Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection to peer b3803219 with addr tcp://10.0.0.65:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-09-17T12:36:30.074514Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection to peer 0dde40fe with addr tcp://10.0.0.64:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-09-17T12:36:30.074556Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.0.0.64:4567 tcp://10.0.0.65:4567
2018-09-17T12:36:31.074793Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') reconnecting to 0dde40fe (tcp://10.0.0.64:4567), attempt 0
2018-09-17T12:36:31.075179Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') reconnecting to b3803219 (tcp://10.0.0.65:4567), attempt 0
2018-09-17T12:36:31.910630Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection established to 0dde40fe tcp://10.0.0.64:4567
2018-09-17T12:36:31.987476Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection established to b3803219 tcp://10.0.0.65:4567
2018-09-17T12:36:32.075490Z 0 [Note] WSREP: declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)
2018-09-17T12:36:32.075548Z 0 [Note] WSREP: declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2018-09-17T12:36:32.075566Z 0 [Note] WSREP: evs::proto(2e38e14b, OPERATIONAL, view_id(REG,0dde40fe,693)) suspecting node: 0dde40fe
2018-09-17T12:36:32.075607Z 0 [Note] WSREP: evs::proto(2e38e14b, OPERATIONAL, view_id(REG,0dde40fe,693)) suspected node without join message, declaring inactive
2018-09-17T12:36:32.075634Z 0 [Note] WSREP: evs::proto(2e38e14b, OPERATIONAL, view_id(REG,0dde40fe,693)) suspecting node: b3803219
2018-09-17T12:36:32.075644Z 0 [Note] WSREP: evs::proto(2e38e14b, OPERATIONAL, view_id(REG,0dde40fe,693)) suspected node without join message, declaring inactive
2018-09-17T12:36:32.077115Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection established to 0dde40fe tcp://10.0.0.64:4567
2018-09-17T12:36:32.082668Z 0 [Note] WSREP: (2e38e14b, 'tcp://0.0.0.0:4567') connection established to b3803219 tcp://10.0.0.65:4567
2018-09-17T12:36:33.075902Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,0dde40fe,693)
memb {
2e38e14b,0
}
joined {
}
left {
}
partitioned {
0dde40fe,0
b3803219,0
}
)
2018-09-17T12:36:33.076069Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,2e38e14b,694)
memb {
2e38e14b,0
}
joined {
}
left {
}
partitioned {
0dde40fe,0
b3803219,0
}
)
2018-09-17T12:36:33.076083Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-09-17T12:36:33.076135Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2018-09-17T12:36:33.076143Z 0 [Note] WSREP: Trying to continue unpaused monitor
2018-09-17T12:36:33.076150Z 0 [Note] WSREP: Received NON-PRIMARY.
2018-09-17T12:36:33.076158Z 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 293303616)
2018-09-17T12:36:33.076244Z 2 [Note] WSREP: New cluster view: global state: e94fa648-0d4e-11e7-b53b-cf35fa571ba4:293303616, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-09-17T12:36:33.076293Z 2 [Note] WSREP: Setting wsrep_ready to false
2018-09-17T12:36:33.076301Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-09-17T12:36:33.076275Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-09-17T12:36:33.076361Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2018-09-17T12:36:33.076369Z 0 [Note] WSREP: Trying to continue unpaused monitor
2018-09-17T12:36:33.076375Z 0 [Note] WSREP: Received NON-PRIMARY.
NODE B :
2018-09-17T12:36:29.910338Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') connection to peer 2e38e14b with addr tcp://10.0.0.61:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-09-17T12:36:29.910529Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.0.0.61:4567
2018-09-17T12:36:30.910501Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') reconnecting to 2e38e14b (tcp://10.0.0.61:4567), attempt 0
2018-09-17T12:36:31.911475Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') connection established to 2e38e14b tcp://10.0.0.61:4567
2018-09-17T12:36:32.077385Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') connection established to 2e38e14b tcp://10.0.0.61:4567
2018-09-17T12:36:32.081116Z 0 [Note] WSREP: declaring b3803219 at tcp://10.0.0.65:4567 stable
2018-09-17T12:36:32.083570Z 0 [Note] WSREP: Node 0dde40fe state primary
2018-09-17T12:36:32.085508Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,0dde40fe,694)
memb {
0dde40fe,0
b3803219,0
}
joined {
}
left {
}
partitioned {
2e38e14b,0
}
)
2018-09-17T12:36:32.085528Z 0 [Note] WSREP: Save the discovered primary-component to disk
2018-09-17T12:36:32.085769Z 0 [Note] WSREP: forgetting 2e38e14b (tcp://10.0.0.61:4567)
2018-09-17T12:36:32.085803Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2018-09-17T12:36:32.086112Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4e82208f-ba76-11e8-9d76-8360ec68fa9d
2018-09-17T12:36:34.912390Z 0 [Note] WSREP: cleaning up 2e38e14b (tcp://10.0.0.61:4567)
2018-09-17T12:36:35.412545Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') connection to peer b3803219 with addr tcp://10.0.0.65:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-09-17T12:36:36.412788Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') reconnecting to b3803219 (tcp://10.0.0.65:4567), attempt 0
2018-09-17T12:36:37.079189Z 0 [Note] WSREP: (0dde40fe, 'tcp://0.0.0.0:4567') connection established to 2e38e14b tcp://10.0.0.61:4567
2018-09-17T12:36:37.590590Z 0 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2018-09-17T12:36:37.590668Z 0 [Note] WSREP: evs::proto(0dde40fe, GATHER, view_id(REG,0dde40fe,694)) suspecting node: b3803219
2018-09-17T12:36:37.590691Z 0 [Note] WSREP: evs::proto(0dde40fe, GATHER, view_id(REG,0dde40fe,694)) suspected node without join message, declaring inactive
2018-09-17T12:36:38.090835Z 0 [Note] WSREP: declaring node with index 1 inactive (evs.inactive_timeout)
2018-09-17T12:36:39.091407Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,0dde40fe,694)
memb {
0dde40fe,0
}
joined {
}
left {
}
partitioned {
b3803219,0
}
)
2018-09-17T12:36:39.091515Z 0 [Note] WSREP: declaring 2e38e14b at tcp://10.0.0.61:4567 stable
2018-09-17T12:36:39.091724Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-09-17T12:36:39.091755Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2018-09-17T12:36:39.091770Z 0 [Note] WSREP: Trying to continue unpaused monitor
2018-09-17T12:36:39.091783Z 0 [Note] WSREP: Received NON-PRIMARY.
2018-09-17T12:36:39.091797Z 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 293303616)
2018-09-17T12:36:39.091884Z 2 [Note] WSREP: New cluster view: global state: e94fa648-0d4e-11e7-b53b-cf35fa571ba4:293303616, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-09-17T12:36:39.091903Z 2 [Note] WSREP: Setting wsrep_ready to false
2018-09-17T12:36:39.091931Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
I just had to restart node A with /etc/init.d/mysql bootstrap-pxc after editing gratstate.dat (same file on both node A and B, node C was different). Then cluster came back with 2 nodes and i restart node C.
What is the best solution for me ? Just adding a new node ? or Galera Aritrator is possible ?
A fourth node will be the same process as the others, Galera will be a new thing to manage without consuming space for data.
What’s happenng if i reboot / crash / losing Galera Arbitrator ?
Thanks
Yathus