So, this thread has existed for over three years! In that time, the Percona Engineers have basically responded: “We don’t have enough information to help you.” But the subject of this thread is a KNOWN problem, many people have encountered the problem, and there appears to be NO solution.
For our part, we have started using ProxySQL in an attempt to GUARANTEE that all transactions take place on the same cluster node, and that “helped,” but the problem still occurs now and then. Just last night, in the wee hours, our production three-node cluster locked up, would not handle any more queries, and was completely unresponsive to our app. The fact that this can suddenly happen AT ALL is in principle a deal-killer. We’ve invested far too much time already (years!) into trying to figure out HOW the cluster can possibly get itself into this state. And I fail to believe that the Percona Engineers have NEVER encountered this issue in their own testing.
For them to have not discovered this issue themselves (and FIXED it years ago), one of two things is the case, and I don’t know which. Either they are testing on a simple, non-prod, “insignificant” environment which does not model real-world usage AT ALL, or they are not motivated to fix this problem because the only way people pay for “support” is if they are desperate and need on-the-phone, help-me-right-now sorts of communications with the Percona team.
If this seems provocative, I intend it to be. It’s OUTRAGEOUS that a problem of this magnitude can still exist after this many years, and the Percona team apparently doesn’t take seriously how devastating it is to have a production environment suddenly lock-up and affect customers, while we are frantically “bootstrapping” the cluster back into existence and thumping our feet wildly in frustration, while hours of resyncing takes place. There is NO excuse for the fact that this is a KNOWN problem and that the Percona team has not DEVOTED itself to replicating the issue and then fixing it! I am POSITIVE that this issue can be replicated, and Percona should be devoting themselves to doing that very thing! Yet, YEARS go by, and Percona seems to not seriously acknowledge that this even IS a deal-breaking issue!
The fact is that Percona cluster is NOT “ready for prime time.” The master/master approach (which is why you’d really bother with the hassles of a cluster in the first place) is simply NOT reliable, and we’ve devoted ourselves for years to “patching it up” with the likes of ProxySQL and our own custom scripts. ALL we’ve been able to accomplish is “put off” the time in which the cluster WILL crash.
The nature of the “crash” itself seems to be that quite suddenly the nodes cannot sync, as they don’t even have the needed available connections to make connections among themselves! So, this “out of connections” error is not just a “symptom.” It is indicative of a fundamental “flow” taking place between the nodes, such that (and always very suddenly) the nodes cannot communicate among themselves. And, in this state, you cannot simply restart MySQL on one node at a time to “clear” the connections. Once in this state, even a MySQL restart only results in the restarted node hanging there, unable to resync with the other nodes. The needed connections to do so are GONE.
What is needed (apart from Percona’s team tracking down how the problem can occur in the first place) is some setting to ensure that some proportion of available connections are ALWAYS dedicated to inter-node connections, so that syncing can ALWAYS occur, no matter what. No matter how “badly” an application might be written, there is NO EXCUSE that the cluster can get ITSELF into such a state that it cannot even communicate among its own nodes!
In addition, Percona should be logging its “core-state,” including any internal variables that could indicate that it is “in trouble,” whatever that might mean. That way, at least with Nagios/Icinga or some other monitoring service, you could detect that you had better intervene and restart the nodes BEFORE it’s a full-bootstrap event to regain control of the cluster!
What does “in trouble” mean? I don’t know, but Percona’s engineers SHOULD! Again, this thread is three years old, and there is still no taking the issue seriously from Percona’s perspective. If MY application had this severe of a problem, MY team would be working night and day until we had tracked down how it could EVER happen, and it would have been fixed long before THREE YEARS had passed!
So, for anybody encountering this problem (and this thread in the hopes of finding a solution), I’ll tell you what our “solution” is about to be: We give up on Percona. We’re angry, and this very thread made us much angrier! We’ll be moving to PostgreSQL and meanwhile investigating MySQL’s own cluster (which a couple of years ago had not seemed ready for prime time); perhaps it’s better now. But, seriously, Percona is NOT a production-ready “MySQL cluster,” and the Percona team obviously cannot be bothered to track this FUNDAMENTAL problem down on their own and fix it. The response we see repeated here: “We don’t have enough information” is unbelievably LAME!
To the Percona team, most of the people on this thread have simply moved on (as we’re about to do). You’re not even asking the right questions on this thread. It falls to YOU to replicate this problem and FIX it. Meanwhile, users like us that have been pleased enough with the promise of Percona cluster have devoted COUNTLESS man-hours to trying to “patch up” the underlying issue, and we’re done with it. Your dismissive attitude on this thread is quite maddening, and we will no longer try to “patch up” the fact that you don’t acknowledge the problem, track it down, and FIX it after three (and more) years. If you were serious about having a production-ready cluster, you’d be contacting people like us, hat in hand (rather than wanting to CHARGE for the information YOU need), hoping to get us into a screen-sharing session, so that you could review our setup in GREAT detail, so that you could HAVE the information you say on this thread you need. We’d be happy to walk you through our setup, and I believe that you’d be impressed. But in this thread you clearly indicate that you can’t be bothered, and that is, flatly, ridiculous!