Percona XtraDB write-set replication traffic keep alive

Hi there,

my config is a 7 nodes Percona XtraDB 5.7 geo cluster: 5 nodes are EC2 AWS instances and 2 are vSphere VM in our on-prem environment; on-prem nodes can contact cloud nodes thanks to a Site-to-Site tunnesl between our on-prem firewalls and AWS VPC (the subnets are routed, not NATted).
The cluster is the backend of our internal DNS solution (powerdns) and provide a native multi-master configuration also for DNS servers; due the nature of our infrastructure, DNS records rarely change and so there is few replica traffic: because of this our on-prem firewall periodically send TCP reset to on-prem nodes for connections to aws nodes on port 4567 and this cause a dayly nodes evictions and re-join.
Is it possible to configure any sort of keep alive functionality in order to prevent TCP connection reset by the firewalls?

Thanks

1 Like

https://galeracluster.com/library/documentation/galera-parameters.html#evs-keepalive-period

Keepalive heartbeat packets are sent to every node, every 1s by default. This is not enough to keep your firewall happy?

Another option is to simply write something to a dummy table every 1s. Use something like pt-heartbeat, which UPDATEs 1 record every 1s. That will create constant traffic. If your nodes are still being evicted, then you should check the config of your firewall.

1 Like

Hi @matthewb

thanks for your suggestions: are the keepalive heartbeat packets sent through TCP/4567 port? If yes, I think I’m on the wrong way to solve my issue.

Thanks again!

1 Like
  • 4567 is reserved for Galera Cluster Replication traffic. Multicast replication uses both TCP and UDP transport on this port.

Yes, everything Galera traffic is over 4567

1 Like

Thanks for pointing me to the right documentation: is the keepalive messages logged? May I need to start mysql daemon with an higher verbosity option?
In the meantime, I will go ahead with firewall troubleshooting.

Thanks again!

1 Like

The messages are part of the core Galera protocol and are not logged, AFAIK. What is logged is when the heartbeats are missed, and you’ve seen that already.

1 Like

I was able to confirm keepalive packets running a tcpdump capture on servers, something like that:

sudo tcpdump -n -i <server_interface> ‘dst (<node1_ip> or <node2_ip> or <node3_ip> or <node4_ip> or <node5_ip> or <node6_ip>) and dst port 4567’

TCP reset are logged about once or twice a day for every node ip on the firewall but I’m still unable to find the root cause.
Saying that, I’ll go ahead closing this thread.

Thank you all guys!

1 Like