Hi
I have 5 node PXC cluster, 2 nodes in DC1, 3 nodes in DC2. I use GTM to manage active-passive cluster setup.
Issue: Initially app was pointed to Node1 in DC1, which got timed out as seen from other node. But Node1 shows wsrep_cluster_size 5, wsrep_cluster_status primary, since percona service was not down the app connections didn’t fail over to Node2 in DC1 instead caused app failures with below error
mysql saveOrUpdateAuditTrailEntry: failed with Communications link failureThe last packet successfully received from the server was 2,100,003 milliseconds ago. The last packet sent successfully to the server was 2,100,003 milliseconds ago
As part of solving,
-
I have stopped Node1 expecting apps to be routed to Node2, but reads were happening and not writes and app was experiencing slowness.
-
Then restarted Node2, but service was down
-
I could not route the application to DC2, due to other applications(not mysql) were not ready for the failover.
-
To resolve, I have stopped services on all DC2 machines, bootstrapped Node2 in dc1, while all the other nodes in cluster were not up.
-
Node2 was up and applications were able to read write to db, and application had no outage.
-
This point cluster was running with one node, I later joined Node1 in DC1 with clean SST
-
Now 2 nodes in DC1 were up, then I joined Node1(DC2) with wsrep_cluster_address=“gcomm://Node2(DC1)” , it is taking almost 3 hours, still the SST streaming is not complete
-
After step7 completes I need to join Node2, Node3 ( DC2) to the cluster.
With the above steps, when a single node is un-responsive , the entire cluster is affected and recovery is taking long time. Can you please let me know the optimized way to recover considering my case.