whole cluster crashed due to table not synced

Hi All,
We had an 7 nodes cluster which crossing three datacenter. with 2/2/3 on different datacenter.
we got problem as the whole cluster not function due to table not synced with 11 minutes delay…

the table was first created on node 1 with the time 2015-02-25 23:59:50, but it not synced to other nodes immediately, and other node failed at 0:11:00 when there is some operation on the table. this make the node 1 an standalone node.

we want to understanding why the replication not replicated, is there any monitoring metric could be alerting in such case ?
any bug for such case?

logs from other 6 nodes

150226 0:11:00 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 0 seqno: 4190686)
150226 0:11:00 [ERROR] Slave SQL: Error executing row event: ‘Table ‘keystone.credential’ doesn’t exist’, Error_code: 1146
150226 0:11:00 [Warning] WSREP: RBR event 4963 Write_rows apply warning: 1146, 4190686
150226 0:11:00 [Warning] WSREP: Failed to apply app buffer: seqno: 4190686, status: 1
at galera/src/replicator_smm.cpp:apply_wscoll():57
Retrying 2th time


150226 0:11:04 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 0 seqno: 4190686)
150226 0:11:04 [ERROR] Slave SQL: Error executing row event: ‘Table ‘keystone.credential’ doesn’t exist’, Error_code: 1146
150226 0:11:04 [Warning] WSREP: RBR event 4963 Write_rows apply warning: 1146, 4190686
150226 0:11:04 [ERROR] WSREP: Failed to apply trx: source: 8b8e06eb-594e-11e4-99d3-822b0d796705 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 15113818 trx_id: 4453338 seqnos (l: 4198354, g: 4190686, s: 4190685, d: 4190660, ts: 1424934617949390287)
150226 0:11:04 [ERROR] WSREP: Failed to apply trx 4190686 10 times
150226 0:11:04 [ERROR] WSREP: Node consistency compromized, aborting…
150226 0:11:04 [Note] WSREP: Closing send monitor…
150226 0:11:04 [Note] WSREP: Closed send monitor.
150226 0:11:04 [Note] WSREP: gcomm: terminating thread
150226 0:11:04 [Note] WSREP: gcomm: joining thread
150226 0:11:04 [Note] WSREP: gcomm: closing backend
150226 0:11:04 [Note] WSREP: view(view_id(NON_PRIM,3630e80f-594f-11e4-9fd8-b2a0ad1b9db3,33) memb {
adb89b17-5a25-11e4-833d-c63263a4e8ac,
} joined {
} left {
} partitioned {
3630e80f-594f-11e4-9fd8-b2a0ad1b9db3,
8b8e06eb-594e-11e4-99d3-822b0d796705,
9fd32195-594f-11e4-937f-2b0004fbf107,
cd7de3db-594f-11e4-bc4d-c70bf0d0977b,
f19c53ac-9c38-11e4-a5cb-df74780de7f1,
f814b175-594e-11e4-b14e-becb87dc9620,
})
150226 0:11:04 [Note] WSREP: view((empty))
150226 0:11:04 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150226 0:11:04 [Note] WSREP: gcomm: closed
150226 0:11:04 [Note] WSREP: Flow-control interval: [16, 16]
150226 0:11:04 [Note] WSREP: Received NON-PRIMARY.
150226 0:11:04 [Note] WSREP: Shifting SYNCED → OPEN (TO: 4190686)
150226 0:11:04 [Note] WSREP: Received self-leave message.
150226 0:11:04 [Note] WSREP: Flow-control interval: [0, 0]
150226 0:11:04 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150226 0:11:04 [Note] WSREP: Shifting OPEN → CLOSED (TO: 4190686)
150226 0:11:04 [Note] WSREP: RECV thread exiting 0: Success
150226 0:11:04 [Note] WSREP: recv_thread() joined.
150226 0:11:04 [Note] WSREP: Closing replication queue.
150226 0:11:04 [Note] WSREP: Closing slave action queue.
150226 0:11:04 [Note] WSREP: /mysql/home/products/mysql/bin/mysqld: Terminated.

log from the node 1, which shows it can’t connect to all other node after 0:11:04

150226 0:11:04 [Note] WSREP: (8b8e06eb-594e-11e4-99d3-822b0d796705, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.126.52.43:4567
150226 0:11:05 [Note] WSREP: (8b8e06eb-594e-11e4-99d3-822b0d796705, ‘tcp://0.0.0.0:4567’) reconnecting to adb89b17-5a25-11e4-833d-c63263a4e8ac (tcp://10.126.52.43:4567), attempt 0
150226 0:11:05 [Note] WSREP: declaring 3630e80f-594f-11e4-9fd8-b2a0ad1b9db3 stable
150226 0:11:05 [Note] WSREP: declaring 9fd32195-594f-11e4-937f-2b0004fbf107 stable
150226 0:11:05 [Note] WSREP: declaring cd7de3db-594f-11e4-bc4d-c70bf0d0977b stable
150226 0:11:05 [Note] WSREP: declaring f19c53ac-9c38-11e4-a5cb-df74780de7f1 stable
150226 0:11:05 [Note] WSREP: declaring f814b175-594e-11e4-b14e-becb87dc9620 stable
150226 0:11:05 [Note] WSREP: Node 3630e80f-594f-11e4-9fd8-b2a0ad1b9db3 state prim
150226 0:11:05 [Note] WSREP: declaring cd7de3db-594f-11e4-bc4d-c70bf0d0977b stable
150226 0:11:05 [Note] WSREP: declaring f19c53ac-9c38-11e4-a5cb-df74780de7f1 stable
150226 0:11:05 [Note] WSREP: Node 8b8e06eb-594e-11e4-99d3-822b0d796705 state prim
150226 0:11:05 [Warning] WSREP: 8b8e06eb-594e-11e4-99d3-822b0d796705 sending install message failed: Resource temporarily unavailable

Thanks

Are you sure this table was using InnoDB engine? The most likely issue would be either MyISAM table or some replication filters in the config file.

Hello,
we had similar problem (MySQL 5.7with PXC) - MySQL shutdown when applying schema change using pt-osc:
/*

Retrying 4th time
2023-04-06T12:12:40.972099+01:00 267798 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 82102482250)
2023-04-06T12:12:40.972131+01:00 267798 [ERROR] Slave SQL: Error executing row event: ‘Table ‘nkx._TKO_new’ doesn’t exist’, Error_code: 1146
2023-04-06T12:12:40.972145+01:00 267798 [Warning] WSREP: RBR event 19 Update_rows apply warning: 1, 82102482250
2023-04-06T12:12:40.973388+01:00 267798 [ERROR] WSREP: Failed to re-apply trx: source: b5b1e30a-d26a-11ed-b77e-aebe4fd13657 version: 4 local: 1 state: REPLAYING flags: 1 conn_id: 267798 trx_id: 47812678416 seqnos (l: 211791693, g: 82102482250, s: 82102482220, d: 82102482140, ts: 5100995868975778)
2023-04-06T12:12:40.973438+01:00 267798 [ERROR] WSREP: Failed to apply trx 82102482250 4 times
2023-04-06T12:12:40.973461+01:00 267798 [ERROR] WSREP: Node consistency compromized, aborting…
2023-04-06T12:12:40.974109+01:00 267798 [Note] WSREP: turning isolation on
2023-04-06T12:12:40.974335+01:00 267798 [Note] WSREP: Closing send monitor…

*/
The issue was caused when using pt-online-schema-change (simple MODIFY column_name varchar…) for small InnoDB table (30K rows, 10 columns), but highly used (inserts by triggers from other tables).
Does anybody have idea/recomendation how safely make the table structure change without stopping incomming traffic/ussage of the DB?

what was the pt-online-schema-change command that you ran?