Detection/Alert/Failover

I recently had a issue that I am struggling to overcome. I have not yet found anything that gets me to this.

Situation:
The other day in a three node cluster I had a system that hit a drive full situation (I know better monitoring I had it but had a typo in the alerting) Once that happened it kicked all the nodes into a non-primary mode. Even though I was in a HA setup and the KeepAlive I had was calling the following. The systems were geographically diverse and this has worked very well for me for a long time. Because the one system marked the other two systems into non-primary mode the entire system (application wise) went down. This was a soft failure and I need for this to work as a hard failure. How can I account for a “soft” failure very quickly. Ideally I think I would like to kill the process on the initial node that caused this but I am not quite sure HOW I was going to write a script that connects to the load balancer grabs the host name checks the values. then compares against its own values and if the hostname of the load balancer system matches kill the process… Else do nothing but I was worried about a potential cascade effect and I think I would only be able to run it on 2 of the three systems. **Note I am not worried about any data corruption as the other two systems have it and I keep good backups as well…

DELIMITER $$

USE `GSSKeepAlive`$$

DROP PROCEDURE IF EXISTS `sp_GSSKeepAlive`$$

CREATE DEFINER=`xxxxxxxx`@`%` PROCEDURE `sp_GSSKeepAlive`(a VARCHAR(15))
BEGIN
    DECLARE my_wsrep_local_state_comment VARCHAR(25);
    DECLARE my_wsrep_cluster_status VARCHAR(25);
    DECLARE my_read_only VARCHAR(25);
    DECLARE my_wsrep_ready VARCHAR(25);

    SELECT VARIABLE_VALUE INTO my_wsrep_local_state_comment FROM performance_schema.global_status WHERE VARIABLE_NAME = 'wsrep_local_state_comment';
    SELECT VARIABLE_VALUE INTO my_wsrep_cluster_status FROM performance_schema.global_status WHERE VARIABLE_NAME = 'wsrep_cluster_status';
    SELECT VARIABLE_VALUE INTO my_read_only FROM performance_schema.global_variables WHERE VARIABLE_NAME = 'READ_ONLY';
    SELECT VARIABLE_VALUE INTO my_wsrep_ready FROM performance_schema.global_status WHERE VARIABLE_NAME = 'wsrep_ready';

  
    IF
my_wsrep_local_state_comment IN ('Synced')
        
        AND my_wsrep_cluster_status='Primary' 
        AND my_read_only='OFF' 
        AND my_wsrep_ready='ON'
    THEN 
        SELECT my_wsrep_local_state_comment, my_wsrep_cluster_status, my_read_only, my_wsrep_ready;
    END IF;
END$$

DELIMITER ;

Hello @meyerder, why have a script connect to the load balancer? Typically the load balancer performs checks on each backend by connecting to a monitoring port and responding to an HTTP request. This HTTP request executes a script which logs in locally to mysql and checks the cluster status.

This script, percona-xtradb-cluster/scripts/clustercheck.sh at 8.0 · percona/percona-xtradb-cluster · GitHub, is one we officially distribute. You can create a systemd unit file to listen on a port to execute this script and your loadbalancer will either A) keep the backend in the pool, or B) remove it (hard fail).

Yea I modified the script some…

Let me be clear “why have a script connect to the load balancer”. I do Not the load balancer calls each server and does the below procedure call. Depending on the response output to the load balancer it decided if it is healthy or not and what to do (terminate, redirect or otherwise) I am basically doing what you suggest except without a web server

Now with that being even the script that you suggested seems to have the same underlying issue that I experienced. Let us assume that the cluster check is running on each one of the servers and giving the load balancer a ok/fail response. This script in my case would have done exactly the same thing I had happen (unless I am missing something)

My underlying issue is that all three of my servers at almost exactly the same time all went into non-primary mode due to the one server having a full drive. While Ideally I “suspect” this should turn just that server into a non primary but the other servers said hmm well we are talking to it but something isn’t quite right. (servers went to a synced non-primary mode)

When this happens As far as I can tell the quick and dirty solution is to kill the server process on the system with the full drive (then it stops to respond and the other two servers are the quarm and it continues to work)

Thus the reason why I am attempting to figure out how to account for this type of soft failure (or am I missing something in the Docs that I just am not getting)

Thanks

**Side Note:
My Co-worker leveraged/modified the script you posted The script now does the following which MAY account for some of my items The Key line it only does the pkill if the variable hostname is the same as the localhostname My only fear is that in the event of a issue like this if for some reason it cant resolve the load balanced name I’m not quite sure what might happen

Description:  - High Level
	The Script runs every minute on every  node and does the following checks but also does a local write
		Connects to the load balancer name
			wsrep_local_state_comment   -- Must be Synced	
			wsrep_cluster_status  -- Must be Primary
			read_only  -- Must be OFF
			wsrep_ready  -- Must be ON
			
		Reads the Hostname that is running the Database
		Reads the local hostname
		Does insert of the following
			insert into GSSKeepAlive.GSSKeepAlive (node) values (@@hostname);
				If this insert does not complete in 50 seconds the following happens
					the hostname that the script is running from matches the hostname from the database
						pkill -f /usr/sbin/mysqld
						sends txt and email notification  (TBD notification list)
				if the hostname that the script is running on does not match the db hostname value.
					Exit

This is a known issue in Galera/PXC. The issue is that the server with disk full cannot commit the txn. Even though it cannot commit the txn, it still responds to heartbeat messages. This is why the other 2 nodes continue to see that node as “online” and “healthy”. Eventually you reach timeouts and the cluster goes non-primary.

I found several bugs on this in our JIRA, but they indicate PXC 8 containing the solution. If you are on the latest 8.0 and still experiencing this issue, I encourage you to open a new bug https://jira.percona.com/ to let our developers know.

@matthewb I was looking around and did not find those bugs… I may not be searching for the right terms. Can you give me one or two of them?