Too many connections brings down cluster

steven_Lin · August 2, 2019, 1:13pm

I also experienced the problem “Too many connections”. I don’t know why…but I still need to handle the problem.
I found that It if i killed all connections,the dead node will be back.so I wrote a script to do this.

vi /etc/my.cnf
add the flowing
extra_max_connections = 8
extra_port = 33333
wirte a script like this :10_monitor_too_many_connection.pl

===========================================================================================
#!/usr/bin/perl
use File::Basename;
use Cwd ‘abs_path’;
use warnings;
$filename = basename($0);
$filepath = abs_path($0);
$filepath =~s//$filename//;
#print “$filepath\n”;

$datetime=date '+%Y%m%d %H:%M:%S'; chomp $datetime;
$datetime2=date '+%Y%m%d'; chomp $datetime2;
$logfile=“$filepath/logs/too_many_conn_kill_mysql_process_$datetime2.log”;
$port=33333;
$time_limit=30; #超過30秒刪除

my $rtn=mysql -uroot --connect_timeout=5 -e "show status like 'Threads_connected'" 2>&1;
if($rtn=~/Too many connections/){
my @process=mysql -uroot --connect_timeout=5 -P $port -e "show processlist";
foreach $ps (@process){

if($ps=~/(\d+)\s+(.)\s+(\S+):(\d+)\s+(\S+)\s+(\S+)\s+(\d+)\s+(.)/){
$pid=$1;
$account=$2;
$ip=$3;
$port=$4;
$db=$5;
$status=$6;
$time=$7;
$other=$8;
if($account ne ‘root’ && $account ne ‘system user’ && $account ne ‘proxysql_monitor’ && $time >$time_limit ){

print “[ $datetime ] kill pid=$pid account=$account ip=$ip port=$port db=$db status=$status time=$time $other\n”;
system(“echo ‘[ $datetime ] kill pid=$pid account=$account ip=$ip port=$port db=$db status=$status time=$time $other’ >> $logfile”);
$rtn=mysql -uroot --connect_timeout=5 -P $port -e "kill $pid";

}#if

}#foreach
}else{
print “$rtn”;
}

create a crontab job for the script
*/1 * * * * /home/mysql/bin/10_monitor_too_many_connection.pl >/dev/null 2>&1

I don’t know if my script can resolve your problem. but you can try.

Thanks

steven_Lin · August 2, 2019, 1:16pm

I worte a script to handle “Too many connections” problem

vi /etc/my.cnf
add
extra_max_connections = 8
extra_port = 33333

Mulen · August 3, 2024, 2:08pm

Hey @rjensen . Is this fixed in latest versions of Percona 8? Have you manage to deal with this problem?

rjensen · August 5, 2024, 4:57am

I haven’t followed Percona for years. MySQL itself developed a very solid cluster, so we transitioned to that approach. Haven’t looked back.

Mulen · August 26, 2024, 9:24pm

Are you talking about Mysql Group Replication? What Mysql cluster do you use?

Thanks for the reply!

danielops · August 29, 2024, 1:34pm

We’re considering switching away from PXC to group replication as PXC proved to be way too sensitive to network latency and configuration tweaks.

This happened to us recently, in PXC 8.0.35 on kubernetes, connections gradually accumulating and suffocating the cluster completely. We suspect it’s because we set timeouts too high (peer_timeout, suspect_timeout and inactive_timeout) in an attempt to avoid another PXC issue, with frequent desyncing and syncing back of nodes.

It’s like if we set these values too low we see more desyncs. If we set them too high we risk the cluster completely suffocating itself. Feels like walking a tightrope over hell.

We’re considering MySQL InnoDB Cluster or Percona’s own group replication implementation for a somewhat less sensitive solution, but this thread sticks out like a sore thumb because there was never any response to rjensen’s post.

I’m wonder if all of the optimizations and patches Percona makes on top of MySQL’s solutions cause more problems than they solve for some of us.

Thanks

rjensen · August 29, 2024, 7:14pm

Yes, we transitioned to MySQL Group Replication years ago, and it has been essentially problem-free for us. We run dozens of three-node clusters spanning three M1000e chassis over 10GB copper, and we experience essentially zero latency in replication.

We use the provided router. The one problem that has emerged that is not yet fixed is that the router has a memory leak, so a heavily loaded cluster instance will crash the router now and then. However, under Kubernetes, that router pod immediately reboots, so there’s maybe two seconds of downtime. The solution is to run two or three router pods, which MySQL supports, so that if a particular router pod goes down, traffic immediately uses an up router. It’s irritating the the “wonderful folks at Oracle” can’t seem to ever manage to get to the bottom of this memory leak and fix it. But the work-around does work.

I wouldn’t recommend that approach under Docker, because Docker swarm has MASSIVE problems of its own, and we weren’t getting reliable container restarts under Docker swarm (but that’s a vast new topic of discussion not really fit for this forum). However, under Kubernetes, we have finally achieved EXTREMELY reliable DB clustering with MySQL group replication. Indeed, we can yank the power entirely out from under an entire chassis, and each MySQL cluster will instantly fail the primary over to the next available node. Once the chassis comes back up, MySQL will immediately rejoin each missing MySQL node back into the group. We have heavily tested this.

Our company serves dozens of colleges and universities across the USA and Canada, so this approach is heavily tested in production spanning years, and I can highly recommend it. Kubernetes and MySQL group replication cluster WILL get you an extremely reliable, self-healing DB cluster that for us has withstood various events that would have destroyed other approaches we’ve tried over the years. Highly recommended!

Mulen · August 29, 2024, 8:12pm

@danielops @rjensen I thought running MySQL on Kubernetes is not the optimal solution because you introduce another layer of abstraction and bring I/O latencies and processes running on top of other processes, instead of running MySQL baremetal or directly on a VM. Is there a chance Kubernetes is responsible for the problems that you have? Have you tried to run the PXC natively on VMs or Baremetal?

@rjensen I thought about using MySQL Router but wherever i look is not the optimal proxy solution and indeed has its quirks and problems. We will use ProxySQL. Is a fantastic proxy solution so far as i have tested. Does MySQL Group Replication needs MySQL router to work or i can use it with other proxy solutions like ProxySQL?

rjensen · August 29, 2024, 8:50pm

@Mulen Kubernetes is the closest thing to running on bare metal you can get, and is FAR superior to running any “VM” hypervisor! We’ve tried bare metal, VMs (numerous different hypervisors), Docker Swarm, and now Kubernetes over the years, and Kubernetes is by FAR the superior approach to HA clustering. The long and short is that the Percona problems span all host possibilities; we’ve tried them all. By contrast, MySQL cluster “just works” regardless of host approach.

We use containerization now for our overall clustering, because we get self-healing “workers” web-servers, and DB nodes. And Kubernetes is the superior solution to containerization in our experience.

We have not tried ProxySQL for MySQL cluster ourselves, so I can’t speak to that approach. MySQL Router has no “quirks” or “problems” we’re aware of (after years of using it) other than the memory leak. But Kubernetes makes even that one issue a non-issue for us.

I repeat that MySQL cluster on Kubernetes has proved to be THE solution for us (in production across dozens of production clusters) spanning years. This approach is extremely well-tested under HEAVY loads for us, and it is essentially bulletproof.

People can try whatever they want and hold out hope for Percona as long as they want, obviously. But I’m simply responding to the questions here like, “Did you ever get this working?” etc. And the answer to that sort of question is: “Yes, we have ‘it’ working wonderfully spanning years with MySQL cluster and Kubernetes.” We’ve tried many other approaches without reliability. So, I can only respond that the “it” that works for us has nothing to do with Percona.

Hopefully our (LONG!) experiences will help people “cut to the chase” and save countless person-hours of frustration and wee-hours cluster-rebuilds!

danielops · August 30, 2024, 1:21pm

Thank you @rjensen for very valuable information!

@Mulen we didn’t use PXC outside of kube, but I wouldn’t expect that to be an issue. If anything it actually helps in certain situations like rjensen described.

One example of a difference Percona make in their own implementations is how they replace MySQL Router with HAProxy for a default proxy which I think is based on the benchmarks they did according to which Router is the slowest. Isn’t there a trade off though? I mean, there is probably a reason why Oracle sticks to MySQL Router, and I wonder if there’s more to this story.

I used to trust Percona’s changes as truly being expert optimizations, and ever since our weekend crash and running into this thread I’m sort of more cautious about believing that, that’s all.

Mulen · August 30, 2024, 6:13pm

Thank you @rjensen and @danielops for your answers. I will definitely evaluate the MySQL Cluster. I have not used Kubernetes extensively. I know what it does and how does it. I have a fear with the host volumes when we use kubernetes. I do not seem to understand how you self heal a MySQL node if it fails in Kubernetes. Where are the data? Do you use long-horn? Let’s say a MySQL node fails, Kubernetes deploys another MySQL Node and does an import from a mysqldump and then joins the cluster? I do not know what is the best practice to approach this scenario.

Thank you again.

rjensen · August 30, 2024, 6:46pm

@Mulen The “self-healing” of Kubernetes pods is a function of specifying for Kubernetes how many pods of a particular type in a particular cluster it should ensure are up. You can specify host locations as well. So, in the case of a MySQL cluster, you can, for example, specify that Kubernetes must keep three DB nodes up and that each one should be located on a different specified host. That way, you can (as we do) ensure that you have each node located on a disparate chassis (in our case), which means that we can lose up to one entire chassis and have MySQL Cluster still running without interruption.

If the specified host goes down, Kubernetes continues to try to bring back up the missing MySQL node. Of course it fails as long as the host is down, because you specify that that pod/node MUST run on that host. But as soon as the host is up again, Kubernetes brings back up the MySQL pod, which “heals” the Kubernetes cluster. At that point, MySQL Cluster itself takes over the “healing” of the MySQL cluster.

As soon as MySQL Cluster recognizes that the missing node has rejoined the group, MySQL Cluster uses its “redo logs” (much like Oracle RAC) to resync the transactions with the recovered node. The node is not fully “rejoined” until its dataset is in sync with the rest of the cluster, at which point it again becomes a full member.

You can have more than three nodes in a MySQL cluster, so that you could, for example, lose two nodes out of five without going into split-brain, read-only mode. That’s overkill for our use-case, so we don’t bother. But in our use-case, if two nodes are down, then our cluster does go into split-brain, read-only mode. And then you have to go through a manual process to recover the MySQL cluster. It’s not odious, but it is manual. But we only had such problems when we ran on bare-metal, VMs, or Docker Swarm. Since moving to Kubernetes, we haven’t had to do that. Now we live in sighs of relief instead of curses and wasted hours!

Your points about the difference in performance between Router vs HAProxy are irrelevant to our use-cases. Perhaps in some use-cases some fine-grained performance difference might make “the difference,” but I’d have to dig into the details of how and why some particular metric would be the difference-maker, but it doesn’t matter for us. We experience fantastic performance out of our cluster! We do run only SSDs and cluster over 10GB copper. But certainly the Router isn’t any bottleneck.

Other degree audit systems that compete with ours either don’t even try to run live degree audits, or their performance is something like one audit per 30-seconds, or an audit per minute. By stark contrast, our degree audit instances run upwards of 30 audits per second. So, we’re getting better than “adequate” performance out of our MySQL cluster. We don’t feel the motivation to “dig deeper” into comparisons between Router and HAProxy. Again, some other use-case might be different. But we find Router’s performance to be just fine. And we thereby have a vanilla MySQL cluster that does nothing exotic.

There is definitely a learning curve to this approach! But the end result is by FAR the most solid and highest-performance clustering approach (including for the DB layer) of the MANY approaches we have tried over the past two decades. Oracle RAC is the gold standard, if you can afford it. We have brought up PeopleSoft from bare metal on Oracle RAC, and it is impressive. But “for the rest of us” that can’t pass along such exorbitant costs to our customers, the approach we are now using has comparable reliability to Oracle RAC in an open-source context. And that’s impressive in its own right!

IMO, there just isn’t a motivation for the Percona fork anymore. We’ve been there, done that, got the blood/sweat-stained t-shirt, and have no interest in ever going back.

Mulen · September 1, 2024, 12:06pm

Thank you so much for the this extensive reply.

I will evaluate MySQL Group Replication then, learn its proper way to setup, deploy and maintain the cluster and finally go down that road instead of using PXC or anything Galera related.

Again really appreciate for your answers on that topic.

Mulen · September 3, 2024, 9:04am

@rjensen Hi again.

One last question. Shall i use Percona with Group Replication or MySQL Community Edition with Group Replication? What would be better without causing us any issues?

rjensen · September 3, 2024, 2:20pm

We haven’t used Percona in years, so I can’t give you any insight into which would be “better” for you. I can only say that MySQL Cluster on Kubernetes has been the cat’s meow for us.

Topic		Replies	Views
One node gets all queries stuck and collapse due to too many connections Percona XtraDB Cluster 5.x	8	3361	December 17, 2017
MySQL connection issue Percona XtraDB Cluster 8.x	3	63	April 15, 2025
Too many deadlocks Percona XtraDB Cluster 5.x	1	621	April 9, 2018
Help for max_connections Percona XtraDB Cluster 5.x	3	881	December 14, 2018
Urgent Problems with Out Percona XtraDB Cluster Percona XtraDB Cluster 5.x	11	1508	August 13, 2018

Too many connections brings down cluster

}#foreach }else{ print “$rtn”; }

Related topics

}#foreach
}else{
print “$rtn”;
}