Hello all, I am having an issue where some/all the nodes in my cluster are losing connection to each other. This was happening pretty infrequently, say once a week or less, but has in the past week or so become pretty commonplace. Today it has happened 3 times. In the process of trying to debug this I have ruled out any basic network connectivity issues as being the culprit. My gut tells me it has to do with our write-heavy loads and these servers going out of sync with each other or something like that. Unfortunately I lack the experience to properly asses this and pinpoint the issues. Any help would be greatly appreciated.
The cluster is comprised of 5 nodes, spread across 2 data centers with a GRE tunnel between them, latency over the link is usually in the 5-15ms range. 3 nodes at site A and 2 at site B. We restrict writes to a single node at A and are doing reads from only the 3 nodes at A, with B existing as a DR site. There is an HAProxy instance in front of the cluster to handle the load balancing.
When first spinning up the nodes in the second site I did have some issues due to the added latency but tweaking some settings in my.cnf seemed to alleviate it. I am unsure exactly what is best to post here in way of logs or listings to help debug this. All nodes are running Centos7 w/ Percona XtraDB Cluster 5.7.18-29.20.1.el7.x86_64