PXC 8.0 EKS with operator writes hangs when one node leave the cluster

Hi, i have already opened an Issue PXC-4524 what seems a serious bug that we found and reproduced into PXC 8.0
All steps to reproduce are written into issue.
Write also here to have help and a feedback about it

Thanks

There’s no log files attached. Hard to diagnose what the problem is without seeing the error logs from the PXC nodes.

Hi @matthewb i have all logs that you need, on issue PXC-4524 i have insert only show processlist because i can’t find a way to upload it and logs are over the size of the comment, but i can put here.
show_global_status.txt (16.7 KB)
show_innodb_status.txt (37.2 KB)
show_process_list.txt (4.8 KB)

Hello @Antonio_Falzarano, we really need the logs. The 3 items you uploaded do not help diagnose the issue. Please remove log contents for dates prior to this issue, compress them, then upload to the ticket or here. Don’t put them as a comment.

Hi @matthewb i put here logs because on atlassian i can’t find a way to upload it.
Another things the forum doesn’t allow compressed extensions… so i renamed it and added .txt

db6-crash.tgz.txt (53.3 KB)

I see some issues,

Too many connections

You need to increase max_connections or reduce the max number of frontend connections.

2024-10-09T08:41:05.090211Z 9 [ERROR] [MY-010584] [Repl] Replica SQL: Error ‘Table ‘sbtest3’ already exists’ on query. Default database: ‘sbtest’. Query: ‘CREATE TABLE sbtest3( id INTEGER NOT NULL AUTO_INCREMENT, k INTEGER DEFAULT ‘0’ NOT NULL, c CHAR(120) DEFAULT ‘’ NOT NULL, pad CHAR(60) DEFAULT ‘’ NOT NULL, PRIMARY KEY (id) ) /*! ENGINE = innodb */’, Error_code: MY-001050

This caused a cluster-wide revote of membership, and may have caused some nodes to drop out and re-SST. Try to use CREATE TABLE IF NOT EXSTS. The error says ‘Replica SQL’, are you using async replication somewhere? If so, it looks like replication was connected to another node, then to this node, which repeated the binlog contents. If you are using async replication, only 1 member of PXC should be handling this.

Hi @matthewb , too many connections error is a consequence of the bug, writer pxc node hangs the ddl queries and they sums until they reach the max connection.

About replica yes, we have also that but you can skip it because i have reproduced the bug also without it, i give you the test case logs also with the replica attached because was the first that i found, but i have also without it, infact the configuration to replicate the bug that i share with you is without any slave

Yes, please provide all of the exact steps to reproduce the issue without async replication. Include in your steps setting up EKS, installing the operator, launching the cluster, pod status, etc, etc. We need to reproduce it exactly as you, so don’t leave out any details. Please put all these steps into the JIRA ticket, as JIRA and these forums are not linked.