Lost connection after member synchronized with group

Description:

Using the Percona Operator for MySQL based on Percona XtraDB Cluster,
When pxc pods get evicted and allocated to a different node in the k8s cluster, we see a 2-5 minutes window where queries return Lost connection to server during query after the server successfully joins the group and thus the readiness probs succeed and traffic starts getting routed to this pod:

Server successfully recovers and joins the group:

2024-07-22T01:52:14.647834Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-22T01:52:14.647828Z 2 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2024-07-22T01:52:14.647822Z 2 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2024-07-22T01:52:14.647805Z 2 [Note] [MY-000000] [Galera] Server pld-mysql-percona-pxc-2 synced with group
2024-07-22T01:52:14.647785Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 1124498)
2024-07-22T01:52:14.647778Z 0 [Note] [MY-000000] [Galera] Processing event queue:...100.0% (1/1 events) complete.
2024-07-22T01:52:14.647759Z 0 [Note] [MY-000000] [Galera] Member 0.0 (pld-mysql-percona-pxc-2) synced with group.
2024-07-22T01:52:14.646562Z 0 [Note] [MY-000000] [Galera] Processing event queue:... -nan% (0/0 events) complete.
2024-07-22T01:52:14.646532Z 0 [Note] [MY-000000] [Galera] Shifting JOINER -> JOINED (TO: 1124498)
2024-07-22T01:52:14.646523Z 0 [Note] [MY-000000] [Galera] SST leaving flow control
2024-07-22T01:52:14.646498Z 0 [Note] [MY-000000] [Galera] 0.0 (pld-mysql-percona-pxc-2): State transfer from 2.0 (pld-mysql-percona-pxc-1) complete.
2024-07-22T01:52:14.645141Z 2 [Note] [MY-000000] [Galera] Min available from gcache for CC from sst: 949090
2024-07-22T01:52:14.645132Z 2 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from sst: 1124460
2024-07-22T01:52:14.645111Z 2 [Note] [MY-000000] [Galera] Recording CC from sst: 1124498
2024-07-22T01:52:14.643046Z 2 [Note] [MY-000000] [Galera] IST received: 5cdf3b79-2e2d-11ef-bf5a-0fa3b79d49e7:1124498
2024-07-22T01:52:14.640955Z 2 [Note] [MY-000000] [Galera] Draining apply monitors after IST up to 1124498
2024-07-22T01:52:14.637127Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-22T01:52:14.637119Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-22T01:52:14.637111Z 2 [Note] [MY-000000] [WSREP] Server status change initialized -> joined
...

Client trying to query the DB:

22/07/2024 01:52:57.318 File "/usr/local/lib/python3.11/site-packages/MySQLdb/cursors.py", line 319, in _query
db.query(q)
File "/usr/local/lib/python3.11/site-packages/MySQLdb/connections.py", line 254, in query
_mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (2013, 'Lost connection to server during query')
[SQL: SET SESSION MAX_EXECUTION_TIME = 305000;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

a few minutes later we don’t observe these error and queries are successfully executed on all pods including the new one.

Steps to Reproduce:

Install operator
Create manifests for a 3 node pxc cluster
Evict pod and destroy node
Query the DB when the new pod has joined the group

Version:

pxc-operator 1.14.0

Expected Result:

Queries should return successfully. Requests shouldn’t go to a DB if it is not really ready

Actual Result:

Queries fail since the new pod is not really ready to accept connections

Hi @Itiel_Olenick, do you connect to DB via HaProxy?

Mostly yes but we observed this behaviour also when connecting directly to the pxc k8s service.

@Slava_Sarzhan Any ideas as to where i can start debugging this and/or find a solution?