XtraDBCluster 1 Node Crash

aycelen · January 31, 2025, 7:27am

I had 3 cluster databases.
One node went down unceremoniously.
We couldn’t understand why.
It could possibly be because of the table_open_files value (currently the value is 2000).
The database size is 7.2 TB.
When I want to include this node again, there is a delay in the connected software services because the surviving node 1 is looking at node 3.
We have currently distributed the services over 2 nodes. However, the backup was also working on the crashed Node 3. That’s why I have to bring this back soon.

cluster-node-3 donor is cluster-node-2
cluster-node-2 donor is cluster-node-1
cluster-node-1 donor is cluster-node-3

Node 1 and Node 2 is avaliable and node 3 mysql is not working now can you help?
We try to bosstrap=true but we got 2 nodes so it doesnt work.

Using Mysql 5.7 .

Thanks for your time.

Vinodh_Krishnaswamy · January 31, 2025, 8:12am

Hi @aycelen,

We try to bosstrap=true but we got 2 nodes so it doesnt work.

You haven’t mentioned the error that got node3 down. Also, why are you not able to join node3 to the cluster? What error are you getting? You don’t need to bootstrap to join the third node in the pxc cluster unless you want to sync all other nodes from the source of truth node. We need more info here to assist.

Regards,
Vinodh Guruji

aycelen · January 31, 2025, 9:11am

Hi Vinodh,

Thanks for interest.

I can’t mentioned cuz we dont know yet. I am very new to mysql. I can say that I know almost nothing. When I check the logs I can’t find anything and we didn’t have any errorlogs file

There is Node-3 logs

[Note] WSREP: New cluster view: global state: 6a***********--:81, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
[Note] WSREP: Setting wsrep_ready to false
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: New cluster view: global state: 6a--:81, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
[Note] WSREP: Setting wsrep_ready to false
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: New cluster view: global state: 6a--:81, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
[Note] WSREP: Setting wsrep_ready to false
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: State transfer required:
Group state: 6a--:58
Local state: 6a--:81
[Note] WSREP: REPL Protocols: 9 (4, 2)
[Note] WSREP: REPL Protocols: 9 (4, 2)
[Note] WSREP: New cluster view: global state: 6a--:58,
view# 100: Primary, number of nodes: 3, my index: 1, protocol version 3
[Note] WSREP: Setting wsrep_ready to true
[Warning] WSREP: Gap in state sequence. Need state transfer.
[Note] WSREP: Setting wsrep_ready to false
[Note] WSREP: You have configured ‘xtrabackup-v2’ state snapshot transfer method which cannot be performed on a running server.
Wsrep provider won’t be able to fall back to it if other means
of state transfer are unavailable. In that case you will need to restart the server.
[Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 2 → 2) (Increment: 3 → 3)
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: Assign initial position for certification: 58, protocol version: 4
[Note] WSREP: Service thread queue flushed.
[Note] WSREP: Check if state gap can be serviced using IST
[Note] WSREP: IST receiver addr using tcp://IP2:4568
[Note] WSREP: Prepared IST receiver, listening at: tcp://IP2:4568
[Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
[Note] WSREP: Member 1.0 (cluster-node-3) requested state transfer from ‘cluster-node-2’. Selected 2.0 (cluster-node-1)(SYNCED) as donor.
[Note] WSREP: Shifting PRIMARY → JOINER (TO: 56350)
[Note] WSREP: Requesting state transfer: success, donor: 2
[Note] WSREP: GCache history reset: 6a--:81 → 6a--:********58
31253 [Note] Aborted connection 2931253 to db: ‘unconnected’ user: ‘obss’ host: ‘ip’ (Got an error reading communication packets)
[Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 26842852480 bytes
[Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/493 locked buffers
[Note] WSREP: Receiving IST: 1577 writesets, seqnos 5638081881-********58
[Warning] WSREP: 2.0 (cluster-node-1): State transfer to 1.0 (cluster-node-3) failed: -110 (Connection timed out)
[ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():811: Will never receive state. Need to abort.
[Note] WSREP: gcomm: terminating thread
[Note] WSREP: gcomm: joining thread
[Note] WSREP: gcomm: closing backend
[Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,4666,107)
memb {
96728195,0
}
joined {
}
left {
}
partitioned {
4666,0
ad*e74,0
}
)
[Note] WSREP: Current view of cluster as seen by this node
view ((empty))
[Note] Aborted connection 2931241 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:42:11.731132Z 0 [Note] WSREP: gcomm: closed
2025-01-29T08:42:11.731248Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

matthewb · January 31, 2025, 3:28pm

Can you please provide all logs from node3, so we can help determine why it crashed?

No! No! When you bootstrap you will create a BRAND NEW CLUSTER! You do not want this! You already have a running cluster. The ONLY time you bootstrap is when all nodes are down and you need to start the cluster.

Remove this config. You should allow any node to receive from any node.

Check your network. Node3 and node1 are having issues talking to each other. Make sure 3306, 4444, 4567, and 4568 ports are open.

aycelen · February 4, 2025, 7:31am

2025-01-29T06:28:34.362220Z 2931041 [Note] Access denied for user ‘root’@‘10.100.11.112’ (using password: NO)
2025-01-29T06:28:36.819107Z 2931042 [Note] Access denied for user ‘root’@‘10.100.11.112’ (using password: YES)
2025-01-29T07:08:51.871881Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection to peer 46feb866 with addr tcp://node-2-ip:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout), socket stats: rtt: 5351 rttvar: 9104 rto: 1664000 lost: 1 last_data_recv: 3284 cwnd: 1 last_queued_since: 284117525 last_delivered_since: 3283166268 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2025-01-29T07:08:51.873551Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection to peer ade62e74 with addr tcp://node-1-ip:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout), socket stats: rtt: 17475 rttvar: 22223 rto: 1760000 lost: 1 last_data_recv: 3284 cwnd: 1 last_queued_since: 22816 last_delivered_since: 3286786749 send_queue_length: 1 send_queue_bytes: 212 segment: 0 messages: 1
2025-01-29T07:08:51.873636Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://node-1-ip:4567 tcp://node-2-ip:4567
2025-01-29T07:08:53.370598Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to ade62e74 (tcp://node-1-ip:4567), attempt 0
2025-01-29T07:08:53.379307Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 0
2025-01-29T07:08:53.381841Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection established to ade62e74 tcp://node-1-ip:4567
2025-01-29T07:08:53.382290Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection established to 46feb866 tcp://node-2-ip:4567
2025-01-29T07:08:56.871621Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2025-01-29T07:17:28.554882Z 2930789 [Note] Aborted connection 2930789 to db: ‘kep’ user: ‘aycelen’ host: ‘10.100.2.37’ (Got an error writing communication packets)
2025-01-29T07:42:21.292505Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection to peer ade62e74 with addr tcp://node-1-ip:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout), socket stats: rtt: 17009 rttvar: 22366 rto: 1760000 lost: 2 last_data_recv: 3000 cwnd: 1 last_queued_since: 31974112 last_delivered_since: 3001063341 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2025-01-29T07:42:21.292949Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection to peer 46feb866 with addr tcp://node-2-ip:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout), socket stats: rtt: 19765 rttvar: 20934 rto: 1760000 lost: 2 last_data_recv: 3000 cwnd: 1 last_queued_since: 32725 last_delivered_since: 3001969509 send_queue_length: 1 send_queue_bytes: 212 segment: 0 messages: 1
2025-01-29T07:42:21.293044Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://node-1-ip:4567 tcp://node-2-ip:4567
2025-01-29T07:42:22.293397Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to ade62e74 (tcp://node-1-ip:4567), attempt 0
2025-01-29T07:42:22.293814Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 0
2025-01-29T07:42:23.294335Z 0 [Note] WSREP: declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)
2025-01-29T07:42:23.294471Z 0 [Note] WSREP: declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2025-01-29T07:42:23.294550Z 0 [Note] WSREP: evs::proto(96728195, OPERATIONAL, view_id(REG,46feb866,103)) suspecting node: 46feb866
2025-01-29T07:42:23.294600Z 0 [Note] WSREP: evs::proto(96728195, OPERATIONAL, view_id(REG,46feb866,103)) suspected node without join message, declaring inactive
2025-01-29T07:42:23.294683Z 0 [Note] WSREP: evs::proto(96728195, OPERATIONAL, view_id(REG,46feb866,103)) suspecting node: ade62e74
2025-01-29T07:42:23.294733Z 0 [Note] WSREP: evs::proto(96728195, OPERATIONAL, view_id(REG,46feb866,103)) suspected node without join message, declaring inactive
2025-01-29T07:42:23.795966Z 0 [Note] WSREP: declaring node with index 0 inactive (evs.inactive_timeout)
2025-01-29T07:42:23.796059Z 0 [Note] WSREP: declaring node with index 2 inactive (evs.inactive_timeout)
2025-01-29T07:42:24.296454Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,46feb866,103)
memb {
96728195,0
}
joined {
}
left {
}
partitioned {
46feb866,0
ade62e74,0
}
)
2025-01-29T07:42:24.296658Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,96728195,104)
memb {
96728195,0
}
joined {
}
left {
}
partitioned {
46feb866,0
ade62e74,0
}
)
2025-01-29T07:42:24.297722Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2025-01-29T07:42:24.298610Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2025-01-29T07:42:24.298628Z 0 [Note] WSREP: Received NON-PRIMARY.
2025-01-29T07:42:24.298643Z 0 [Note] WSREP: Shifting SYNCED → OPEN (TO: 5638081881)
2025-01-29T07:42:24.298681Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2025-01-29T07:42:24.298701Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2025-01-29T07:42:24.298714Z 0 [Note] WSREP: Received NON-PRIMARY.
2025-01-29T07:43:52.835171Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to ade62e74 (tcp://node-1-ip:4567), attempt 30
2025-01-29T07:43:52.835598Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 30
2025-01-29T07:45:03.363807Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection established to ade62e74 tcp://node-1-ip:4567
2025-01-29T07:45:04.283740Z 0 [Note] WSREP: declaring ade62e74 at tcp://node-1-ip:4567 stable
2025-01-29T07:45:04.285828Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,96728195,106)
memb {
96728195,0
ade62e74,0
}
joined {
}
left {
}
partitioned {
46feb866,0
}
)
2025-01-29T07:45:04.286044Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
2025-01-29T07:45:04.286113Z 0 [Note] WSREP: Flow-control interval: [141, 141]
2025-01-29T07:45:04.286128Z 0 [Note] WSREP: Received NON-PRIMARY.
2025-01-29T07:45:13.366482Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 60
2025-01-29T07:45:57.882974Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 90
2025-01-29T07:46:41.429052Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 120
2025-01-29T07:47:25.445801Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 150
2025-01-29T07:48:09.966401Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 180
2025-01-29T07:48:54.485367Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 210
2025-01-29T07:49:36.513628Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 240
2025-01-29T07:50:20.090122Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 270
2025-01-29T07:51:04.612571Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 300
2025-01-29T07:51:49.137741Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 330
2025-01-29T07:52:33.166942Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 360
2025-01-29T07:53:17.692930Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 390
2025-01-29T07:54:02.214871Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 420
2025-01-29T07:54:44.327339Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 450
2025-01-29T07:55:29.347305Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 480
2025-01-29T07:56:12.869327Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 510
2025-01-29T07:56:56.890811Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 540
2025-01-29T07:57:40.411009Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 570
2025-01-29T07:58:24.928691Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 600
2025-01-29T07:59:09.449138Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 630
2025-01-29T07:59:53.018206Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 660
2025-01-29T08:00:38.040715Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 690
2025-01-29T08:01:22.571328Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 720
2025-01-29T08:02:07.129724Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 750
2025-01-29T08:02:51.655205Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 780
2025-01-29T08:03:34.735015Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 810
2025-01-29T08:04:17.303461Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 840
2025-01-29T08:05:00.837300Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) reconnecting to 46feb866 (tcp://node-2-ip:4567), attempt 870
2025-01-29T08:38:08.192588Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) connection established to 46feb866 tcp://node-2-ip:4567
2025-01-29T08:38:08.717385Z 0 [Note] WSREP: declaring 46feb866 at tcp://node-2-ip:4567 stable
2025-01-29T08:38:08.717473Z 0 [Note] WSREP: declaring ade62e74 at tcp://node-1-ip:4567 stable
2025-01-29T08:38:08.722481Z 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2025-01-29T08:38:08.723524Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,46feb866,107)
memb {
46feb866,0
96728195,0
ade62e74,0
}
joined {
}
left {
}
partitioned {
}
)
2025-01-29T08:38:08.723583Z 0 [Note] WSREP: Save the discovered primary-component to disk
2025-01-29T08:38:08.725920Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2025-01-29T08:38:08.726006Z 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2025-01-29T08:38:08.733741Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 5de1----------------------------------------9559
2025-01-29T08:38:08.738910Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5de1----------------------------------------9559 from 0 (cluster-node-2)
2025-01-29T08:38:08.738967Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5de1----------------------------------------9559 from 1 (cluster-node-3)
2025-01-29T08:38:08.739015Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5de1----------------------------------------9559 from 2 (cluster-node-1)
2025-01-29T08:38:08.739100Z 0 [Warning] WSREP: Quorum: No node with complete state:

Version      : 6
Flags        : 0x1
Protocols    : 0 / 9 / 3
State        : NON-PRIMARY
Desync count : 0
Prim state   : NON-PRIMARY
Prim UUID    : 00000000-0000-0000-0000-000000000000
Prim  seqno  : -1
First seqno  : -1
Last  seqno  : 5638083458
Prim JOINED  : 0
State UUID   : 5de1----------------------------------------9559
Group UUID   : 6a9----------------------------------------082c8b
Name         : 'cluster-node-2'
Incoming addr: 'node-2-ip:3306'

Version      : 6
Flags        : 0x2
Protocols    : 0 / 9 / 3
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : 6dfca015-8043-11ef-b823-9eaa2e7e3deb
Prim  seqno  : 98
First seqno  : 5636175719
Last  seqno  : 5638081881
Prim JOINED  : 3
State UUID   : 5de1----------------------------------------9559
Group UUID   : 6a9----------------------------------------082c8b
Name         : 'cluster-node-3'
Incoming addr: 'node-3-ip:3306'

Version      : 6
Flags        : 0x2
Protocols    : 0 / 9 / 3
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : 94a73d9d-de14-11ef-9440-8f77a3b29473
Prim  seqno  : 99
First seqno  : 5636524389
Last  seqno  : 5638083458
Prim JOINED  : 2
State UUID   : 5de1----------------------------------------9559
Group UUID   : 6a9----------------------------------------082c8b
Name         : 'cluster-node-1'
Incoming addr: 'node-1-ip:3306'

2025-01-29T08:38:08.742996Z 0 [Note] WSREP: Partial re-merge of primary 94a73d9d-de14-11ef-9440-8f77a3b29473 found: 1 of 2.
2025-01-29T08:38:08.743046Z 0 [Note] WSREP: Quorum results:
version = 6,
component = PRIMARY,
conf_id = 99,
members = 2/3 (primary/total),
act_id = 5638083458,
last_appl. = 5638081602,
protocols = 0/9/3 (gcs/repl/appl),
group UUID = 6a9----------------------------------------082c8b
2025-01-29T08:38:08.743107Z 0 [Note] WSREP: Flow-control interval: [173, 173]
2025-01-29T08:38:08.743146Z 0 [Note] WSREP: Shifting OPEN → PRIMARY (TO: 5638083458)
2025-01-29T08:38:08.743209Z 0 [Note] WSREP: Member 0.0 (cluster-node-2) synced with group.
2025-01-29T08:38:11.691627Z 0 [Note] WSREP: (96728195, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2025-01-29T08:39:43.742325Z 2931038 [Note] Bad handshake
2025-01-29T08:39:43.745324Z 2931037 [Note] Aborted connection 2931037 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:43.745450Z 2931046 [Note] Got an error reading communication packets
2025-01-29T08:39:43.746354Z 2931051 [Note] Got an error reading communication packets
2025-01-29T08:39:43.746525Z 2931052 [Note] Got an error reading communication packets
2025-01-29T08:39:43.746609Z 2931053 [Note] Got an error reading communication packets
2025-01-29T08:39:43.746673Z 2931054 [Note] Got an error reading communication packets
2025-01-29T08:39:43.746794Z 2931055 [Note] Got an error reading communication packets
2025-01-29T08:39:43.747006Z 2931056 [Note] Got an error reading communication packets
2025-01-29T08:39:43.747302Z 2931057 [Note] Got an error reading communication packets
2025-01-29T08:39:43.748941Z 2931036 [Note] Aborted connection 2931036 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:43.749216Z 2931039 [Note] Aborted connection 2931039 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:43.749819Z 2930782 [Note] Aborted connection 2930782 to db: ‘kep’ user: ‘aycelen’ host: ‘10.100.2.37’ (Got an error writing communication packets)
2025-01-29T08:39:43.764850Z 2931081 [Note] Got an error writing communication packets
2025-01-29T08:39:43.764988Z 2931084 [Note] Got an error writing communication packets
2025-01-29T08:39:43.773632Z 2931092 [Note] Got an error writing communication packets
2025-01-29T08:39:43.779290Z 2931102 [Note] Got an error writing communication packets
2025-01-29T08:39:43.787880Z 2931111 [Note] Got an error writing communication packets
2025-01-29T08:39:43.807901Z 2931127 [Note] Got an error writing communication packets
2025-01-29T08:39:43.819679Z 2931158 [Note] Got an error reading communication packets
2025-01-29T08:39:43.820430Z 2931159 [Note] Got an error reading communication packets
2025-01-29T08:39:43.820532Z 2931160 [Note] Got an error reading communication packets
2025-01-29T08:39:43.930633Z 2931040 [Note] Aborted connection 2931040 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:43.936918Z 2931043 [Note] Aborted connection 2931043 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:43.938268Z 2931044 [Note] Aborted connection 2931044 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:39:44.257885Z 12 [Note] WSREP: New cluster view: global state: 6a9----------------------------------------082c8b:5638081881, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2025-01-29T08:39:44.257967Z 12 [Note] WSREP: Setting wsrep_ready to false
2025-01-29T08:39:44.258027Z 12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2025-01-29T08:39:44.258100Z 12 [Note] WSREP: New cluster view: global state: 6a9----------------------------------------082c8b:5638081881, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2025-01-29T08:39:44.258135Z 12 [Note] WSREP: Setting wsrep_ready to false
2025-01-29T08:39:44.258168Z 12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2025-01-29T08:39:44.258239Z 12 [Note] WSREP: New cluster view: global state: 6a9----------------------------------------082c8b:5638081881, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
2025-01-29T08:39:44.258273Z 12 [Note] WSREP: Setting wsrep_ready to false
2025-01-29T08:39:44.258305Z 12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2025-01-29T08:39:44.258443Z 12 [Note] WSREP: State transfer required:
Group state: 6a9----------------------------------------082c8b:5638083458
Local state: 6a9----------------------------------------082c8b:5638081881
2025-01-29T08:39:44.258492Z 12 [Note] WSREP: REPL Protocols: 9 (4, 2)
2025-01-29T08:39:44.258534Z 12 [Note] WSREP: REPL Protocols: 9 (4, 2)
2025-01-29T08:39:44.258577Z 12 [Note] WSREP: New cluster view: global state: 6a9----------------------------------------082c8b:5638083458, view# 100: Primary, number of nodes: 3, my index: 1, protocol version 3
2025-01-29T08:39:44.258610Z 12 [Note] WSREP: Setting wsrep_ready to true
2025-01-29T08:39:44.258642Z 12 [Warning] WSREP: Gap in state sequence. Need state transfer.
2025-01-29T08:39:44.258673Z 12 [Note] WSREP: Setting wsrep_ready to false
2025-01-29T08:39:44.258711Z 12 [Note] WSREP: You have configured ‘xtrabackup-v2’ state snapshot transfer method which cannot be performed on a running server. Wsrep provider won’t be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2025-01-29T08:39:44.258746Z 12 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 2 → 2) (Increment: 3 → 3)
2025-01-29T08:39:44.258777Z 12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2025-01-29T08:39:44.263546Z 12 [Note] WSREP: Assign initial position for certification: 5638083458, protocol version: 4
2025-01-29T08:39:44.264107Z 0 [Note] WSREP: Service thread queue flushed.
2025-01-29T08:39:44.264258Z 12 [Note] WSREP: Check if state gap can be serviced using IST
2025-01-29T08:39:44.266889Z 12 [Note] WSREP: IST receiver addr using tcp://node-3-ip:4568
2025-01-29T08:39:44.268849Z 12 [Note] WSREP: Prepared IST receiver, listening at: tcp://node-3-ip:4568
2025-01-29T08:39:44.268910Z 12 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
2025-01-29T08:39:44.270707Z 0 [Note] WSREP: Member 1.0 (cluster-node-3) requested state transfer from ‘cluster-node-2’. Selected 2.0 (cluster-node-1)(SYNCED) as donor.
2025-01-29T08:39:44.270763Z 0 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 5638083950)
2025-01-29T08:39:44.270934Z 12 [Note] WSREP: Requesting state transfer: success, donor: 2
2025-01-29T08:39:44.271018Z 12 [Note] WSREP: GCache history reset: 6a9----------------------------------------082c8b:5638081881 → 6a9----------------------------------------082c8b:5638083458
2025-01-29T08:39:44.392963Z 2931253 [Note] Aborted connection 2931253 to db: ‘unconnected’ user: ‘obss’ host: ‘10.100.11.26’ (Got an error reading communication packets)
2025-01-29T08:42:11.711079Z 12 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 26842852480 bytes
2025-01-29T08:42:11.711323Z 12 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/493 locked buffers
2025-01-29T08:42:11.720963Z 12 [Note] WSREP: Receiving IST: 1577 writesets, seqnos 5638081881-5638083458
2025-01-29T08:42:11.721414Z 0 [Warning] WSREP: 2.0 (cluster-node-1): State transfer to 1.0 (cluster-node-3) failed: -110 (Connection timed out)
2025-01-29T08:42:11.721447Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():811: Will never receive state. Need to abort.
2025-01-29T08:42:11.721492Z 0 [Note] WSREP: gcomm: terminating thread
2025-01-29T08:42:11.721540Z 0 [Note] WSREP: gcomm: joining thread
2025-01-29T08:42:11.721985Z 0 [Note] WSREP: gcomm: closing backend
2025-01-29T08:42:11.723751Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,46feb866,107)
memb {
96728195,0
}
joined {
}
left {
}
partitioned {
46feb866,0
ade62e74,0
}
)
2025-01-29T08:42:11.723886Z 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2025-01-29T08:42:11.724004Z 2931241 [Note] Aborted connection 2931241 to db: ‘unconnected’ user: ‘pmm’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2025-01-29T08:42:11.731132Z 0 [Note] WSREP: gcomm: closed
2025-01-29T08:42:11.731248Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Thank God it didnt work already !

This is our wsrep.cnf

[mysqld]

Path to Galera library

wsrep_provider=/usr/lib/galera3/libgalera_smm.so

Cluster connection URL contains IPs of nodes

#If no IP is found, this implies that a new cluster needs to be created,
#in order to do that you need to bootstrap this node
wsrep_cluster_address=gcomm://node-1-ip,node-2-ip,node-3-ip
#wsrep_provider_options=“gcache.size = 25G”
wsrep_provider_options=“gcache.size = 25G”

In order for Galera to work correctly binlog format should be ROW

binlog_format=ROW

MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

Slave thread to use

wsrep_slave_threads= 12
wsrep_log_conflicts

This changes how InnoDB autoincrement locks are managed and is a requirement for Galera

innodb_autoinc_lock_mode=2

Node IP address

wsrep_node_address=node-3-ip

Cluster name

wsrep_cluster_name=cluster
#If wsrep_node_name is not specified, then system hostname will be used
wsrep_node_name=cluster-node-3
#pxc_strict_mode allowed values: DISABLED,PERMISSIVE,ENFORCING,MASTER
pxc_strict_mode=PERMISSIVE

SST method

wsrep_sst_method=xtrabackup-v2

wsrep_sst_donor=cluster-node-2 " We remove this part right? But this part having all nodes so are we need restart mysql services?"
#Authentication for SST method
wsrep_sst_auth=“sstuser:s3cretPass”

I think the reason our system team change the port being panicked

matthewb · February 4, 2025, 1:17pm

I see this over and over, repeated, repeated. You have network issues that need to be resolved.

reconnecting to ade62e74 (tcp://node-1-ip:4567), attempt 30

connection to peer 46feb866 with addr tcp://node-2-ip:4567 timed out, no messages 
seen in PT3S

Yes, remove this.

aycelen · February 4, 2025, 1:33pm

I reliazed it but i thought the becouse of their change the port after crash the node.

I got it but do we need restart all the nodes?
I ask so much cuz db is 7 TB so our service must be closed that time.
Or we just remove the node 3,
remove them

systemctl stop mysql
rm -rf /var/lib/mysql/grastate.dat
rm -rf /var/lib/mysql/galera.cache
rm -rf /var/lib/mysql/ib_logfile*
rm -rf /var/lib/mysql/ibdata1
systemctl start mysql``
...

and start SST?

The file called gvwstate are we removing this file too?

Vinodh_Krishnaswamy · February 4, 2025, 5:17pm

Hi,

I got it but do we need restart all the nodes?

You don’t need to restart all the nodes. The nodes node1 and node2 are working fine already. So to force SST, you just need to restart node3 after removing grastate.data file from the data directory (removing this file will force SST in the node) and node3 will join the cluster again with SST. Else if you start it normally, it may join back with IST if it can find events to replicate from the donor node. Anyhow, You have to fix port/network issues first and then allow node3 to join via IST or SST.

wsrep_sst_donor=cluster-node-2
Usually, when you mention the above, only node2 can be procured as DONOR. If node2 is not available, the IST/SST fails. To overcome this, you can add a trailing comma (,) at the end which will instruct the Galera to try other nodes as DONOR when node2 is unavailable for IST/SST. like below:

wsrep_sst_donor=cluster-node-2,

for now, you can avoid this parameter as you have network issues. Please make sure the necessary ports are opened already. See here - Secure the network - Percona XtraDB Cluster

Let us know how it goes.

aycelen · February 5, 2025, 7:35am

Hi,

Thanks for all the good explanations.
I will try this out this weekend and update here.
I hope it goes well.

aycelen · February 7, 2025, 6:55am

Hi there,

I hope you all well.
I have a last question we need to start SST this node and do we need delete any DB files on server.
Just remove the grastate.dat is okey?
The tables and other stuff should be stay?
When SST is started will remove automatically?

matthewb · February 7, 2025, 12:52pm

Yes, when an SST is required the datadir will be cleared automatically.

aycelen · February 9, 2025, 8:22am

Hello there me again,

Our SST is complete and 3 nodes are sycned.

But now we had a problem again.
performans_schema.session_variables doesnt match so we had some error and doesnt connect the db with rdp any user(even root).

So that’s the deal,

We search on it and we need to do "mysql_upgrade" for the mysql tables. We thought that it because of the version differences.

crash node 
mysql> select version();
+------------------+
| version()        |
+------------------+
| 5.7.40-43-57-log |

donor node
mysql> select version();
+------------------+
| version()        |
+------------------+
| 5.7.44-48-57-log |
+------------------+
1 row in set (0.00 sec)

The errors;

2025-02-09T07:41:23.427191Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /DB/mysql/xb_doublewrite
2025-02-09T07:41:23.429391Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2025-02-09T07:41:24.197895Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /DB/mysql/xb_doublewrite, size 31457280 bytes
2025-02-09T07:41:25.271953Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2025-02-09T07:41:25.272189Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2025-02-09T07:41:25.309224Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2025-02-09T07:41:25.310071Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2025-02-09T07:41:25.310098Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2025-02-09T07:41:25.313471Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.40-43 started; log sequence number 119885958895404
2025-02-09T07:41:25.315698Z 0 [Note] InnoDB: Loading buffer pool(s) from /DB/mysql/ib_buffer_pool
2025-02-09T07:41:25.553879Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2025-02-09T07:41:25.553929Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2025-02-09T07:41:25.553960Z 0 [Warning] A deprecated TLS version TLSv1 is enabled. Please use TLSv1.2 or higher.
2025-02-09T07:41:25.553974Z 0 [Warning] A deprecated TLS version TLSv1.1 is enabled. Please use TLSv1.2 or higher.
2025-02-09T07:41:25.555016Z 0 [Warning] CA certificate ca.pem is self signed.
2025-02-09T07:41:25.555103Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2025-02-09T07:41:25.555322Z 0 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
2025-02-09T07:41:25.555380Z 0 [Note]   - '0.0.0.0' resolves to '0.0.0.0';
2025-02-09T07:41:25.555458Z 0 [Note] Server socket created on IP: '0.0.0.0'.
2025-02-09T07:41:25.578160Z 0 [Note] Failed to start slave threads for channel ''
2025-02-09T07:41:25.597662Z 0 [ERROR] Incorrect definition of table performance_schema.global_variables: expected column 'VARIABLE_VALUE' at position 1 t              o have type varchar(2048), found type varchar(4096).
2025-02-09T07:41:25.597940Z 0 [ERROR] Incorrect definition of table performance_schema.session_variables: expected column 'VARIABLE_VALUE' at position 1               to have type varchar(1024), found type varchar(4096).
2025-02-09T07:41:25.598733Z 0 [Note] Event Scheduler: Loaded 0 events
2025-02-09T07:41:25.599061Z 0 [Note] WSREP: Signalling provider to continue on SST completion.
2025-02-09T07:41:25.599094Z 0 [Note] WSREP: Initialized wsrep sidno 8
2025-02-09T07:41:25.599135Z 0 [Note] WSREP: SST received: 6a9e2cd4-c80a-11ed-8824-4b6350082c8b:5746569190
2025-02-09T07:41:25.599874Z 2 [Note] WSREP: Receiving IST: 67819 writesets, seqnos 5746569190-5746637009
2025-02-09T07:41:25.600109Z 0 [Note] WSREP: Receiving IST...  0.0% (    0/67819 events) complete.
2025-02-09T07:41:25.604686Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.40-43-57-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release rel43, Revision ab4d0bd, WSREP vers              ion 31.63, wsrep_31.63
2025-02-09T07:41:27.690780Z 19 [Note] Access denied for user 'UNKNOWN_MYSQL_USER'@'localhost' (using password: NO)

Then we tried to upgrade mysql that error comes to in;

mysql_upgrade: Got error: 2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) while connecting to the MySQL server

I think this is related to mysql service not stopping or starting. When we did this the status was “installed”. But this file was not in the path specified here.

Now questions?

Can we do mysql_upgrade while this node is in sync with the cluster because node-2 version is same as crash node.
Will there be any problem for node-2 when we run “mysql_upgrade”?

Isn’t it necessary for mysql service to be active while doing this?

Any suggestions?

matthewb · February 10, 2025, 7:29pm

Yes, you can’t run mysql_upgrade unless mysql is running. If node1 is correct with regards to the schema changes, then simply erase node2’s datadir and let it SST a full copy from node1. That full copy will include the correct schema and you won’t need to run the upgrade on node2

aycelen · February 11, 2025, 1:09pm

aycelen:

2025-02-09T07:41:25.597940Z 0 [ERROR] Incorrect definition of table performance_schema.session_variables: expected column 'VARIABLE_VALUE' at position 1               to have type varchar(1024), found type varchar(4096).

This is how can I solve the problem.

[ERROR] Incorrect definition of table performance_schema.session_variables:

service mysql stop
ps aux | grep mysql
--if it doesnt work db doesn't stop 

/etc/init.d/mysql status

--first kill the mysql services
ps aux | grep mysql

kill -9 UID --mysql ID's

-- MYSQL STOP CLEARLY THAN START THE MYSQL BELOW METHOD

"mysqld --skip-grant-tables --user=mysql --wsrep-provider='none'
ps aux | grep mysql"

-- CHECH IT AN THAN

mysql_upgrade 

--YOU MAY GET A mysql_upgrade: Got error: 2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) while connecting to the MySQL server error
--wait finish the mysql activating then try again

mysql_upgrade 

--after finish mysql_upgrade

--stop the mysqld wresp=none than restart the mysql normally 

select * from performance_schema.session_variables; 

--check this query

Topic		Replies	Views
SST failing for third node Percona XtraDB Cluster 8.x	4	1698	July 19, 2022
How to stop and start an XtraDB Mysql cluster of 3 nodes/ Percona XtraDB Cluster 8.x	7	1750	September 19, 2023
Issues with Xtradb cluster Percona XtraDB Cluster 5.x	8	2112	February 13, 2014
Node Shutdown after start Percona XtraDB Cluster 5.x	12	3809	August 4, 2015
Can't reconnect to cluster after reboot Percona XtraDB Cluster 5.x	15	1159	February 5, 2024