We have 2 datacenters with 2 nodes and 1 datacenter with only one node.
Sometimes an insert in a node would take down all nodes (except itself) with a “Node consistency compromized”, bringing down the cluster.
Looking at the log we could see the error:
[ERROR] Slave SQL: Could not execute Write_rows event on table main.documents; Duplicate entry ‘3729-3882600-01P2017-17040’ for key ‘sequence’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log FIRST, end_log_pos 455, Error_code: 1062
It seemed like nodes were accepting the same statement more than once (from different nodes), and when applying transactions, it would raise the error.
So, we’ve investigated and read more about WAN implementation and made some changes, like defining segments, changing network configuration, using transactions instead of autocommited statements…
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_slow_start_after_idle = 0
innodb_flush_log_at_trx_commit = 2
binlog_row_image = minimal
wsrep_sync_wait = 2
wsrep_provider_options = ’
Despite that, we keep getting this kind of error…