I wanted to chime in, because I had a similar error but have a little extra debugging information that might help with the search. I believe we are hitting the same bug in mysql/percona.
As with Jorang, I also have a DC with 2 nodes (node A, node B), and a second DC with 1 node (node C). 3 nodes in total.
- No local scripts were running on the nodes. That would explain the inconsistency but as far as I know, that is not happening. And even if it did, the sync should have worked the other way around as well. (multi write master right?)
- All nodes are running:
Server version: 5.7.18-15-57 Percona XtraDB Cluster (GPL),
Release rel15, Revision 7693d6e,
WSREP version 29.20, wsrep_29.20 - I have checked the release notes of version 5.7.19 but I can’t see any mention of this error.
- Jorang: could you paste your version as well? - All nodes are Ubuntu 16.04
- Only node B remained on line.
- Node A and Node C were shut down because of a node consistency problem
- Upon restart Node A or Node C, a full SST was done. (according to the log it couldn’t do an IST because of an unexpected shutdown)
- I got the same HA_ERR_FOUND_DUPP_KEY error but in my case it is clearly a unique index on two columns. Jorang is that “sequence” key something you know? Is it a simple unique key, or a unique key on two columns?
- Node A and Node C had the same error in /var/log/mysql/error.log (similar to Jorang’s error):
2017-10-20T08:06:43.877112Z 6 [ERROR] Slave SQL: Could not execute Write_rows event on table himalaya_tdv_renault.inspection; Duplicate entry ‘224-560744’ for key ‘un_inspection_repairorder’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 490, Error_code: 1062
2017-10-20T08:06:43.877135Z 6 [Warning] WSREP: RBR event 5 Write_rows apply warning: 121, 32518800
2017-10-20T08:06:43.877598Z 6 [Warning] WSREP: Failed to apply app buffer: seqno: 32518800, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
…
Retrying 3th time
…
Retrying 4th time
…
2017-10-20T08:06:43.878931Z 6 [ERROR] WSREP: Failed to apply trx 32518800 4 times
2017-10-20T08:06:43.878936Z 6 [ERROR] WSREP: Node consistency compromized, aborting…
2017-10-20T08:06:43.878943Z 6 [Note] WSREP: Closing send monitor…
2017-10-20T08:06:43.878947Z 6 [Note] WSREP: Closed send monitor.
2017-10-20T08:06:43.878953Z 6 [Note] WSREP: gcomm: terminating thread
2017-10-20T08:06:43.878969Z 6 [Note] WSREP: gcomm: joining thread
2017-10-20T08:06:43.879100Z 6 [Note] WSREP: gcomm: closing backend
I’m guessing the developers would like a reproducible use case. But I don’t have one at the moment, I don’t know where to begin… Any idea on how we can narrow this error down?
As with Jorang, our system is running again, but since I don’t know exactly what caused it, I’m pretty sure it can happen again.