LockOwner after partitioning the node where the Operator is running (Kubernetes)

Sebastiano_La_Terra · April 24, 2022, 9:25am

I have the following problem. I have installed operator and cluster on Kubernetes. When I simulate network partitioning and into the isolated node the operator is running, the operator goes into a Terminating state without releasing the Lock. In the logs of the new replication I find the error message: “Found existing lock”, “LockOwner”: “”. Can anyone help me please?

matthewb · April 24, 2022, 6:22pm

Please provide the logs and all commands used to create this scenario so that we can reproduce it.

Sebastiano_La_Terra · April 25, 2022, 7:01am

I have a Kubernetes cluster composed by 4 nodes (3 redundant master node and 1 worker node). Their IP address are:

node1 192.168.1.180;
node2 192.168.1.102;
node3 192.168.1.183;
node4 192.168.1.142 (worker).

I followed this guide to install PXC ( Install Percona XtraDB Cluster on Kubernetes ).

NAME                                               READY   STATUS        RESTARTS         AGE   IP           NODE      
cluster1-haproxy-0                                 2/2     Running       0                39m   10.42.1.4    server4   
cluster1-haproxy-1                                 2/2     Running       0                32m   10.42.0.14   server1   
cluster1-haproxy-2                                 2/2     Running       0                31m   10.42.3.6    server3   
cluster1-pxc-0                                     3/3     Running       0                39m   10.42.3.5    server3   
cluster1-pxc-1                                     3/3     Running       0                33m   10.42.2.6    server2   
cluster1-pxc-2                                     3/3     Running       0                29m   10.42.1.6    server4   
percona-xtradb-cluster-operator-566848cf48-4tgg4   1/1     Running       0                43m   10.42.2.4    server2

Once everything is in Running status and since operator is running into node2, I executed this command from node2 in order to simulate a network parititioning ( I isolated the node where the operator is Running):

sudo iptables -A OUTPUT -j DROP -d ${NODE_1}
sudo iptables -A INPUT -j DROP -s ${NODE_1}
sudo iptables -A OUTPUT -j DROP -d ${NODE_3}
sudo iptables -A INPUT -j DROP -s ${NODE_3}
sudo iptables -A OUTPUT -j DROP -d ${NODE_4}
sudo iptables -A INPUT -j DROP -s ${NODE_4}

Now the scenario is:

NAME                                               READY   STATUS        RESTARTS         AGE   IP           NODE      
cluster1-haproxy-0                                 2/2     Running       0                39m   10.42.1.4    server4   
cluster1-haproxy-1                                 2/2     Running       0                32m   10.42.0.14   server1   
cluster1-haproxy-2                                 2/2     Running       0                31m   10.42.3.6    server3   
cluster1-pxc-0                                     3/3     Running       0                39m   10.42.3.5    server3   
cluster1-pxc-1                                     3/3     Terminating   0                33m   10.42.2.6    server2   
cluster1-pxc-2                                     3/3     Running       0                29m   10.42.1.6    server4   
percona-xtradb-cluster-operator-566848cf48-4tgg4   1/1     Terminating   0                43m   10.42.2.4    server2   
percona-xtradb-cluster-operator-566848cf48-mc4w8   1/1     CrashLoopBackOff 6 (10s ago)   19m   10.42.0.15   server1

Here I have the problem. The new replica of the operator (called percona-xtradb-cluster-operator-566848cf48-mc4w8) has the following error logs (the key owner is the previous pod percona-xtradb-cluster-operator-566848cf48-4tgg4):

{"level":"info","ts":1650868726.2155206,"logger":"cmd","msg":"Runs on","platform":"kubernetes","version":"v1.22.7+k3s1"}
{"level":"info","ts":1650868726.2157087,"logger":"cmd","msg":"Git commit: 038082365e4e94cfdda40a20ce1b53fc098e5efb Git branch: release-1-10-0 Build time: 2021-11-17T16:46:03Z"}
{"level":"info","ts":1650868726.2157269,"logger":"cmd","msg":"Go Version: go1.17.3"}
{"level":"info","ts":1650868726.2157404,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1650868726.2157545,"logger":"cmd","msg":"operator-sdk Version: v0.19.4"}
{"level":"info","ts":1650868726.21606,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1650868727.059792,"logger":"leader","msg":"Found existing lock","LockOwner":"percona-xtradb-cluster-operator-566848cf48-4tgg4"}
{"level":"info","ts":1650868727.233043,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868728.4153585,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868730.8620806,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868735.4514487,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868744.2399726,"logger":"leader","msg":"Not the leader. Waiting."}

Is there a way to get the lock released first, so that everything will continue to work even in the event of this network partitioning?

matthewb · April 25, 2022, 4:54pm

By running those IP tables commands, you are also blocking K8S traffic which is causing this lock issue. This is not an issue with our Operator, but an issue with your K8S setup. I suggest you reach out on various K8S forums regarding this issue. A quick google search of “kubernetes Trying to become the leader” revealed many posts regarding this issue. Good luck!

Topic		Replies	Views
New Cluster locked after starting second node Percona XtraDB Cluster 5.x	2	779	September 17, 2012
Percona XtraDB, Question to the experts!!! ;) Percona XtraDB Cluster 5.x	3	861	August 10, 2015
Cluster down > BF-BF X lock conflict Percona XtraDB Cluster 5.x	0	2205	September 11, 2017
Cluster failed, can you provide any insight? Percona XtraDB Cluster 5.x	11	1218	November 8, 2013
Nodes terminated when addeted new Percona XtraDB Cluster 5.x	3	905	March 27, 2015

LockOwner after partitioning the node where the Operator is running (Kubernetes)

Related topics