LockOwner after partitioning the node where the Operator is running (Kubernetes)

I have the following problem. I have installed operator and cluster on Kubernetes. When I simulate network partitioning and into the isolated node the operator is running, the operator goes into a Terminating state without releasing the Lock. In the logs of the new replication I find the error message: “Found existing lock”, “LockOwner”: “”. Can anyone help me please?

1 Like

Please provide the logs and all commands used to create this scenario so that we can reproduce it.

1 Like

I have a Kubernetes cluster composed by 4 nodes (3 redundant master node and 1 worker node). Their IP address are:

  • node1 192.168.1.180;
  • node2 192.168.1.102;
  • node3 192.168.1.183;
  • node4 192.168.1.142 (worker).

I followed this guide to install PXC ( Install Percona XtraDB Cluster on Kubernetes ).

NAME                                               READY   STATUS        RESTARTS         AGE   IP           NODE      
cluster1-haproxy-0                                 2/2     Running       0                39m   10.42.1.4    server4   
cluster1-haproxy-1                                 2/2     Running       0                32m   10.42.0.14   server1   
cluster1-haproxy-2                                 2/2     Running       0                31m   10.42.3.6    server3   
cluster1-pxc-0                                     3/3     Running       0                39m   10.42.3.5    server3   
cluster1-pxc-1                                     3/3     Running       0                33m   10.42.2.6    server2   
cluster1-pxc-2                                     3/3     Running       0                29m   10.42.1.6    server4   
percona-xtradb-cluster-operator-566848cf48-4tgg4   1/1     Running       0                43m   10.42.2.4    server2

Once everything is in Running status and since operator is running into node2, I executed this command from node2 in order to simulate a network parititioning ( I isolated the node where the operator is Running):

sudo iptables -A OUTPUT -j DROP -d ${NODE_1}
sudo iptables -A INPUT -j DROP -s ${NODE_1}
sudo iptables -A OUTPUT -j DROP -d ${NODE_3}
sudo iptables -A INPUT -j DROP -s ${NODE_3}
sudo iptables -A OUTPUT -j DROP -d ${NODE_4}
sudo iptables -A INPUT -j DROP -s ${NODE_4}

Now the scenario is:

NAME                                               READY   STATUS        RESTARTS         AGE   IP           NODE      
cluster1-haproxy-0                                 2/2     Running       0                39m   10.42.1.4    server4   
cluster1-haproxy-1                                 2/2     Running       0                32m   10.42.0.14   server1   
cluster1-haproxy-2                                 2/2     Running       0                31m   10.42.3.6    server3   
cluster1-pxc-0                                     3/3     Running       0                39m   10.42.3.5    server3   
cluster1-pxc-1                                     3/3     Terminating   0                33m   10.42.2.6    server2   
cluster1-pxc-2                                     3/3     Running       0                29m   10.42.1.6    server4   
percona-xtradb-cluster-operator-566848cf48-4tgg4   1/1     Terminating   0                43m   10.42.2.4    server2   
percona-xtradb-cluster-operator-566848cf48-mc4w8   1/1     CrashLoopBackOff 6 (10s ago)   19m   10.42.0.15   server1

Here I have the problem. The new replica of the operator (called percona-xtradb-cluster-operator-566848cf48-mc4w8) has the following error logs (the key owner is the previous pod percona-xtradb-cluster-operator-566848cf48-4tgg4):

{"level":"info","ts":1650868726.2155206,"logger":"cmd","msg":"Runs on","platform":"kubernetes","version":"v1.22.7+k3s1"}
{"level":"info","ts":1650868726.2157087,"logger":"cmd","msg":"Git commit: 038082365e4e94cfdda40a20ce1b53fc098e5efb Git branch: release-1-10-0 Build time: 2021-11-17T16:46:03Z"}
{"level":"info","ts":1650868726.2157269,"logger":"cmd","msg":"Go Version: go1.17.3"}
{"level":"info","ts":1650868726.2157404,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1650868726.2157545,"logger":"cmd","msg":"operator-sdk Version: v0.19.4"}
{"level":"info","ts":1650868726.21606,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1650868727.059792,"logger":"leader","msg":"Found existing lock","LockOwner":"percona-xtradb-cluster-operator-566848cf48-4tgg4"}
{"level":"info","ts":1650868727.233043,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868728.4153585,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868730.8620806,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868735.4514487,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868744.2399726,"logger":"leader","msg":"Not the leader. Waiting."}

Is there a way to get the lock released first, so that everything will continue to work even in the event of this network partitioning?

1 Like

By running those IP tables commands, you are also blocking K8S traffic which is causing this lock issue. This is not an issue with our Operator, but an issue with your K8S setup. I suggest you reach out on various K8S forums regarding this issue. A quick google search of “kubernetes Trying to become the leader” revealed many posts regarding this issue. Good luck!

1 Like