my config is a 7 nodes Percona XtraDB 5.7 geo cluster: 5 nodes are EC2 AWS instances and 2 are vSphere VM in our on-prem environment; on-prem nodes can contact cloud nodes thanks to a Site-to-Site tunnesl between our on-prem firewalls and AWS VPC (the subnets are routed, not NATted).
The cluster is the backend of our internal DNS solution (powerdns) and provide a native multi-master configuration also for DNS servers; due the nature of our infrastructure, DNS records rarely change and so there is few replica traffic: because of this our on-prem firewall periodically send TCP reset to on-prem nodes for connections to aws nodes on port 4567 and this cause a dayly nodes evictions and re-join.
Is it possible to configure any sort of keep alive functionality in order to prevent TCP connection reset by the firewalls?
Keepalive heartbeat packets are sent to every node, every 1s by default. This is not enough to keep your firewall happy?
Another option is to simply write something to a dummy table every 1s. Use something like pt-heartbeat, which UPDATEs 1 record every 1s. That will create constant traffic. If your nodes are still being evicted, then you should check the config of your firewall.
thanks for your suggestions: are the keepalive heartbeat packets sent through TCP/4567 port? If yes, I think I’m on the wrong way to solve my issue.
4567 is reserved for Galera Cluster Replication traffic. Multicast replication uses both TCP and UDP transport on this port.
Yes, everything Galera traffic is over 4567
Thanks for pointing me to the right documentation: is the keepalive messages logged? May I need to start mysql daemon with an higher verbosity option?
In the meantime, I will go ahead with firewall troubleshooting.
The messages are part of the core Galera protocol and are not logged, AFAIK. What is logged is when the heartbeats are missed, and you’ve seen that already.
I was able to confirm keepalive packets running a tcpdump capture on servers, something like that:
sudo tcpdump -n -i <server_interface> ‘dst (<node1_ip> or <node2_ip> or <node3_ip> or <node4_ip> or <node5_ip> or <node6_ip>) and dst port 4567’
TCP reset are logged about once or twice a day for every node ip on the firewall but I’m still unable to find the root cause.
Saying that, I’ll go ahead closing this thread.
Thank you all guys!