whole cluster crashed due to table not synced

andymao · February 26, 2015, 4:53am

Hi All,
We had an 7 nodes cluster which crossing three datacenter. with 2/2/3 on different datacenter.
we got problem as the whole cluster not function due to table not synced with 11 minutes delay…

the table was first created on node 1 with the time 2015-02-25 23:59:50, but it not synced to other nodes immediately, and other node failed at 0:11:00 when there is some operation on the table. this make the node 1 an standalone node.

we want to understanding why the replication not replicated, is there any monitoring metric could be alerting in such case ?
any bug for such case?

logs from other 6 nodes

150226 0:11:00 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 0 seqno: 4190686)
150226 0:11:00 [ERROR] Slave SQL: Error executing row event: ‘Table ‘keystone.credential’ doesn’t exist’, Error_code: 1146
150226 0:11:00 [Warning] WSREP: RBR event 4963 Write_rows apply warning: 1146, 4190686
150226 0:11:00 [Warning] WSREP: Failed to apply app buffer: seqno: 4190686, status: 1
at galera/src/replicator_smm.cpp:apply_wscoll():57
Retrying 2th time
…
…
150226 0:11:04 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 0 seqno: 4190686)
150226 0:11:04 [ERROR] Slave SQL: Error executing row event: ‘Table ‘keystone.credential’ doesn’t exist’, Error_code: 1146
150226 0:11:04 [Warning] WSREP: RBR event 4963 Write_rows apply warning: 1146, 4190686
150226 0:11:04 [ERROR] WSREP: Failed to apply trx: source: 8b8e06eb-594e-11e4-99d3-822b0d796705 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 15113818 trx_id: 4453338 seqnos (l: 4198354, g: 4190686, s: 4190685, d: 4190660, ts: 1424934617949390287)
150226 0:11:04 [ERROR] WSREP: Failed to apply trx 4190686 10 times
150226 0:11:04 [ERROR] WSREP: Node consistency compromized, aborting…
150226 0:11:04 [Note] WSREP: Closing send monitor…
150226 0:11:04 [Note] WSREP: Closed send monitor.
150226 0:11:04 [Note] WSREP: gcomm: terminating thread
150226 0:11:04 [Note] WSREP: gcomm: joining thread
150226 0:11:04 [Note] WSREP: gcomm: closing backend
150226 0:11:04 [Note] WSREP: view(view_id(NON_PRIM,3630e80f-594f-11e4-9fd8-b2a0ad1b9db3,33) memb {
adb89b17-5a25-11e4-833d-c63263a4e8ac,
} joined {
} left {
} partitioned {
3630e80f-594f-11e4-9fd8-b2a0ad1b9db3,
8b8e06eb-594e-11e4-99d3-822b0d796705,
9fd32195-594f-11e4-937f-2b0004fbf107,
cd7de3db-594f-11e4-bc4d-c70bf0d0977b,
f19c53ac-9c38-11e4-a5cb-df74780de7f1,
f814b175-594e-11e4-b14e-becb87dc9620,
})
150226 0:11:04 [Note] WSREP: view((empty))
150226 0:11:04 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150226 0:11:04 [Note] WSREP: gcomm: closed
150226 0:11:04 [Note] WSREP: Flow-control interval: [16, 16]
150226 0:11:04 [Note] WSREP: Received NON-PRIMARY.
150226 0:11:04 [Note] WSREP: Shifting SYNCED → OPEN (TO: 4190686)
150226 0:11:04 [Note] WSREP: Received self-leave message.
150226 0:11:04 [Note] WSREP: Flow-control interval: [0, 0]
150226 0:11:04 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150226 0:11:04 [Note] WSREP: Shifting OPEN → CLOSED (TO: 4190686)
150226 0:11:04 [Note] WSREP: RECV thread exiting 0: Success
150226 0:11:04 [Note] WSREP: recv_thread() joined.
150226 0:11:04 [Note] WSREP: Closing replication queue.
150226 0:11:04 [Note] WSREP: Closing slave action queue.
150226 0:11:04 [Note] WSREP: /mysql/home/products/mysql/bin/mysqld: Terminated.

log from the node 1, which shows it can’t connect to all other node after 0:11:04

150226 0:11:04 [Note] WSREP: (8b8e06eb-594e-11e4-99d3-822b0d796705, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.126.52.43:4567
150226 0:11:05 [Note] WSREP: (8b8e06eb-594e-11e4-99d3-822b0d796705, ‘tcp://0.0.0.0:4567’) reconnecting to adb89b17-5a25-11e4-833d-c63263a4e8ac (tcp://10.126.52.43:4567), attempt 0
150226 0:11:05 [Note] WSREP: declaring 3630e80f-594f-11e4-9fd8-b2a0ad1b9db3 stable
150226 0:11:05 [Note] WSREP: declaring 9fd32195-594f-11e4-937f-2b0004fbf107 stable
150226 0:11:05 [Note] WSREP: declaring cd7de3db-594f-11e4-bc4d-c70bf0d0977b stable
150226 0:11:05 [Note] WSREP: declaring f19c53ac-9c38-11e4-a5cb-df74780de7f1 stable
150226 0:11:05 [Note] WSREP: declaring f814b175-594e-11e4-b14e-becb87dc9620 stable
150226 0:11:05 [Note] WSREP: Node 3630e80f-594f-11e4-9fd8-b2a0ad1b9db3 state prim
150226 0:11:05 [Note] WSREP: declaring cd7de3db-594f-11e4-bc4d-c70bf0d0977b stable
150226 0:11:05 [Note] WSREP: declaring f19c53ac-9c38-11e4-a5cb-df74780de7f1 stable
150226 0:11:05 [Note] WSREP: Node 8b8e06eb-594e-11e4-99d3-822b0d796705 state prim
150226 0:11:05 [Warning] WSREP: 8b8e06eb-594e-11e4-99d3-822b0d796705 sending install message failed: Resource temporarily unavailable

Thanks

przemek · April 12, 2015, 7:27am

Are you sure this table was using InnoDB engine? The most likely issue would be either MyISAM table or some replication filters in the config file.

VladimirCZ · April 6, 2023, 12:55pm

Hello,
we had similar problem (MySQL 5.7with PXC) - MySQL shutdown when applying schema change using pt-osc:
/*
…
Retrying 4th time
2023-04-06T12:12:40.972099+01:00 267798 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 82102482250)
2023-04-06T12:12:40.972131+01:00 267798 [ERROR] Slave SQL: Error executing row event: ‘Table ‘nkx._TKO_new’ doesn’t exist’, Error_code: 1146
2023-04-06T12:12:40.972145+01:00 267798 [Warning] WSREP: RBR event 19 Update_rows apply warning: 1, 82102482250
2023-04-06T12:12:40.973388+01:00 267798 [ERROR] WSREP: Failed to re-apply trx: source: b5b1e30a-d26a-11ed-b77e-aebe4fd13657 version: 4 local: 1 state: REPLAYING flags: 1 conn_id: 267798 trx_id: 47812678416 seqnos (l: 211791693, g: 82102482250, s: 82102482220, d: 82102482140, ts: 5100995868975778)
2023-04-06T12:12:40.973438+01:00 267798 [ERROR] WSREP: Failed to apply trx 82102482250 4 times
2023-04-06T12:12:40.973461+01:00 267798 [ERROR] WSREP: Node consistency compromized, aborting…
2023-04-06T12:12:40.974109+01:00 267798 [Note] WSREP: turning isolation on
2023-04-06T12:12:40.974335+01:00 267798 [Note] WSREP: Closing send monitor…
…
*/
The issue was caused when using pt-online-schema-change (simple MODIFY column_name varchar…) for small InnoDB table (30K rows, 10 columns), but highly used (inserts by triggers from other tables).
Does anybody have idea/recomendation how safely make the table structure change without stopping incomming traffic/ussage of the DB?

yunus_shaikh · May 11, 2023, 4:42pm

what was the pt-online-schema-change command that you ran?

Topic		Replies	Views
Have an issue with cluster 2 Nodes keep dropping offline every day & rejoin issues. Percona XtraDB Cluster 5.x	2	1637	June 19, 2015
Cluster failure Percona XtraDB Cluster 5.x	2	506	December 23, 2023
Cluster nodes keep crashing Percona XtraDB Cluster 5.x	0	575	December 7, 2014
Cluster dos not synchronized Percona XtraDB Cluster 5.x	0	556	December 14, 2014
Problem both nodes not sync Percona XtraDB Cluster 5.x	2	3679	August 14, 2012

whole cluster crashed due to table not synced

logs from other 6 nodes

log from the node 1, which shows it can’t connect to all other node after 0:11:04

Related topics