Disconnected Node and KeepAlives

meyerder · January 31, 2022, 3:30pm

To All:

I know a couple of bugs Exist in PXC 8. Some result in race conditions that will freeze the cluster for 600 seconds until it crashes… As such a co-worker and I worked out the following Two Items for our setup which is located behind AVI/Netscalers (Not using HAProxy) I know this is a bit extreme in reducing the End-user impact but I could not figure out any other means of doing it

**My setup all reads/writes go to one system. The others act as a backup in the event that the primary has an issue.

Script 1. - Runs on the AVI/Netscaler to check Each server’s condition. Will remove a node if it becomes a Donor or otherwise.

keepalive_procedure.txt (1.2 KB)

Script#2
High Level - Does Everything the KeepAlive check does and also processes an insert. If the Insert takes > 20 seconds it issues a pkill on the system if it is the primary system (this accounts for the Stalled condition and reduces the time due to the race condition from 600 seconds to less than a minute.)

local_script.txt (13.2 KB)

I have one condition that I have had happen several times that I have yet to figure out how to address. For some reason once in a bit I have an insert procedure that happens that initiates a quarm vote when it hits a duplicate key (1062 Error) The Primary system goes into Disconnected and the VIP/GSLB moves traffic to another system. The issue is any persistent connections are still attached to the system that is now Disconnected and receive wsrep errors until the connections are terminated. I do not know yet how to deal with that condition.

Thank you

Michael_Coburn · January 31, 2022, 7:01pm

Hi @meyerder

Could you share the error logs from the node being evicted along with a node that remained in the cluster?

This is not expected behaviour on a DUPLICATE KEY violation - these should be resolved within InnoDB and never raise to the wsrep galera level. Unless you actually do have data differences on your nodes, in which case you’d want to validate consistency by either forcing SST from a single node, or executing pt-table-checksum in order to identify data differences.

Topic		Replies	Views
How to Account for Persistent Connections Percona XtraDB Cluster 8.x	10	544	February 1, 2022
PXC 5.6 crashes while blacklisting some IPs Percona XtraDB Cluster 5.x	2	1610	April 26, 2014
Percona 8.x cluster nodes dropping out unknown as to why Percona XtraDB Cluster 8.x	4	77	July 31, 2024
Detection/Alert/Failover Percona XtraDB Cluster 8.x	5	347	January 16, 2024
PXC 8.0 EKS with operator writes hangs when one node leave the cluster Percona XtraDB Cluster 8.x	8	81	November 21, 2024

Disconnected Node and KeepAlives

Related topics