I have the following problem. I have installed operator and cluster on Kubernetes. When I simulate network partitioning and into the isolated node the operator is running, the operator goes into a Terminating state without releasing the Lock. In the logs of the new replication I find the error message: “Found existing lock”, “LockOwner”: “”. Can anyone help me please?
Please provide the logs and all commands used to create this scenario so that we can reproduce it.
I have a Kubernetes cluster composed by 4 nodes (3 redundant master node and 1 worker node). Their IP address are:
- node1 192.168.1.180;
- node2 192.168.1.102;
- node3 192.168.1.183;
- node4 192.168.1.142 (worker).
I followed this guide to install PXC ( Install Percona XtraDB Cluster on Kubernetes ).
NAME READY STATUS RESTARTS AGE IP NODE
cluster1-haproxy-0 2/2 Running 0 39m 10.42.1.4 server4
cluster1-haproxy-1 2/2 Running 0 32m 10.42.0.14 server1
cluster1-haproxy-2 2/2 Running 0 31m 10.42.3.6 server3
cluster1-pxc-0 3/3 Running 0 39m 10.42.3.5 server3
cluster1-pxc-1 3/3 Running 0 33m 10.42.2.6 server2
cluster1-pxc-2 3/3 Running 0 29m 10.42.1.6 server4
percona-xtradb-cluster-operator-566848cf48-4tgg4 1/1 Running 0 43m 10.42.2.4 server2
Once everything is in Running status and since operator is running into node2, I executed this command from node2 in order to simulate a network parititioning ( I isolated the node where the operator is Running):
sudo iptables -A OUTPUT -j DROP -d ${NODE_1}
sudo iptables -A INPUT -j DROP -s ${NODE_1}
sudo iptables -A OUTPUT -j DROP -d ${NODE_3}
sudo iptables -A INPUT -j DROP -s ${NODE_3}
sudo iptables -A OUTPUT -j DROP -d ${NODE_4}
sudo iptables -A INPUT -j DROP -s ${NODE_4}
Now the scenario is:
NAME READY STATUS RESTARTS AGE IP NODE
cluster1-haproxy-0 2/2 Running 0 39m 10.42.1.4 server4
cluster1-haproxy-1 2/2 Running 0 32m 10.42.0.14 server1
cluster1-haproxy-2 2/2 Running 0 31m 10.42.3.6 server3
cluster1-pxc-0 3/3 Running 0 39m 10.42.3.5 server3
cluster1-pxc-1 3/3 Terminating 0 33m 10.42.2.6 server2
cluster1-pxc-2 3/3 Running 0 29m 10.42.1.6 server4
percona-xtradb-cluster-operator-566848cf48-4tgg4 1/1 Terminating 0 43m 10.42.2.4 server2
percona-xtradb-cluster-operator-566848cf48-mc4w8 1/1 CrashLoopBackOff 6 (10s ago) 19m 10.42.0.15 server1
Here I have the problem. The new replica of the operator (called percona-xtradb-cluster-operator-566848cf48-mc4w8) has the following error logs (the key owner is the previous pod percona-xtradb-cluster-operator-566848cf48-4tgg4):
{"level":"info","ts":1650868726.2155206,"logger":"cmd","msg":"Runs on","platform":"kubernetes","version":"v1.22.7+k3s1"}
{"level":"info","ts":1650868726.2157087,"logger":"cmd","msg":"Git commit: 038082365e4e94cfdda40a20ce1b53fc098e5efb Git branch: release-1-10-0 Build time: 2021-11-17T16:46:03Z"}
{"level":"info","ts":1650868726.2157269,"logger":"cmd","msg":"Go Version: go1.17.3"}
{"level":"info","ts":1650868726.2157404,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1650868726.2157545,"logger":"cmd","msg":"operator-sdk Version: v0.19.4"}
{"level":"info","ts":1650868726.21606,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1650868727.059792,"logger":"leader","msg":"Found existing lock","LockOwner":"percona-xtradb-cluster-operator-566848cf48-4tgg4"}
{"level":"info","ts":1650868727.233043,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868728.4153585,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868730.8620806,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868735.4514487,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1650868744.2399726,"logger":"leader","msg":"Not the leader. Waiting."}
Is there a way to get the lock released first, so that everything will continue to work even in the event of this network partitioning?
By running those IP tables commands, you are also blocking K8S traffic which is causing this lock issue. This is not an issue with our Operator, but an issue with your K8S setup. I suggest you reach out on various K8S forums regarding this issue. A quick google search of “kubernetes Trying to become the leader” revealed many posts regarding this issue. Good luck!