To All:
I know a couple of bugs Exist in PXC 8. Some result in race conditions that will freeze the cluster for 600 seconds until it crashes… As such a co-worker and I worked out the following Two Items for our setup which is located behind AVI/Netscalers (Not using HAProxy) I know this is a bit extreme in reducing the End-user impact but I could not figure out any other means of doing it
**My setup all reads/writes go to one system. The others act as a backup in the event that the primary has an issue.
Script 1. - Runs on the AVI/Netscaler to check Each server’s condition. Will remove a node if it becomes a Donor or otherwise.
keepalive_procedure.txt (1.2 KB)
Script#2
High Level - Does Everything the KeepAlive check does and also processes an insert. If the Insert takes > 20 seconds it issues a pkill on the system if it is the primary system (this accounts for the Stalled condition and reduces the time due to the race condition from 600 seconds to less than a minute.)
local_script.txt (13.2 KB)
I have one condition that I have had happen several times that I have yet to figure out how to address. For some reason once in a bit I have an insert procedure that happens that initiates a quarm vote when it hits a duplicate key (1062 Error) The Primary system goes into Disconnected and the VIP/GSLB moves traffic to another system. The issue is any persistent connections are still attached to the system that is now Disconnected and receive wsrep errors until the connections are terminated. I do not know yet how to deal with that condition.
Thank you