Pvc taint with IP of node, pod stuck in pending in case of node restart/crash

ahad_khan · October 1, 2023, 12:17am

the issue is with the PVC, they are being automatically tainted with IP of the node on which the corresponding pod was scheduled, in case of a pod crash/restart, especially with using a few spot instances of additional capacity. the pod gets stuck in a pending state because of volume affinity conflict. currently deleting the pod fixes the issue, but in an ideal scenario this this should be automatic.

the larger point is using SPOT instances could save 70% money with better firepower as the pod/pvc get scheduled anyway automatically. meanwhile, some portions of the cluster are running of on demand nodes.

latest version of percona operator.
aws eks 1.27
mix of on demand and spot instances. mongos and few replicaset on on demand and rest on spot.

ahad_khan · October 1, 2023, 12:18am

ahad_khan · October 1, 2023, 2:37am

was able to solve this by putting
annotations:
volume.kubernetes.io/selected-node: None

hope this won’t break some other thing.

btw is this a good way to go ?? I am getting sometime this Multi-Attach error for volume, don’t know why that is happening as it’s the same pod. just when the instance dies and it gets reprovisioned it throws this error.

cost and database performance wise this could be a very good idea, only that the DB should neve go down as anyway some of it will be on demand instance.

Sergey_Pronin · October 4, 2023, 7:25am

Hey @ahad_khan ,

thanks for submitting the question.
Interesting case.

When you say that you use PVCs - do you use regular AWS EBS volumes or some other technology?
Are you running across multiple Availability Zones? It might happen that AWS runs out of specific spot nodes in one AZ. Then you face a situation that the node is in AZ-1, but the volume was created in AZ-2.

ahad_khan · October 10, 2023, 4:28pm

Hello @Sergey_Pronin,

thanks for the response.

as for pvc yes we just deploy the volumes with the default EBS volumes, without many changes to the default values.yaml using helm.
yes you are right, this does seem the most probable explanation, currently, there are three issues actually.
a. one is the node affinity issue when the AZ/IP affinity is attached to the PVC/pod, then it won’t schedule and gets stuck, this I believe could be solved using NONE in the values.yaml file for the respective tags.
b. yes it seems like PVC in one AZ can’t attach in case the node is scheduled in another region, now either we can automatically delete the PVC and recreate it, or just make sure the spot instance is in the same AZ.
c. the third regarding the multi-attach issue it seems after further digging we realize this might be happening because we were forcefully terminated instances from the ec2 console for testing, in which case the EKS doesn’t get time to put the PVC in detached mode, which could be the most probable cause of this issue. any other ideas are of course appreciated.

Sergey_Pronin · October 11, 2023, 6:50am

hey @ahad_khan ,

for multi-AZ problem pls have a look at topology aware volumes: Troubleshooting topology aware volumes in EKS | AWS re:Post

As for forcefull termination - yeah, this might be the issue. As long as you use spot instances - do you also use spot-termination handler (like GitHub - aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within Kubernetes) ? With it you will have time to drain the node before it is terminated.

ahad_khan · October 26, 2023, 8:40am

Hello @Sergey_Pronin,
thanks for the prompt response.

we have thoroughly gone through topology-aware volumes, but it feels like this wouldn’t solve our issue. As per our understanding, topology-aware volumes help in making sure that the first time the pod is scheduled the volume is created only in the AZ where the node is created. in our case when the node is rescheduled into another AZ, our understanding is that this won’t delete and create the PVC into the corresponding AZ. please advise if this is correct !!
yes we have tested this and concluded that in case the new spot node is rescheduled into the same AZ, then EKS gracefully handles pod replacement and connecting to same the PVC.

As for solutions, we feel there are only three at the moment.

we make sure that the node is created into the same AZ forcefully by specifying that in the node group constraints. though this isn’t a good idea for obvious regions of fault tolerance.
we write another pod/lambda which will keep checking for node affinity conflict and will delete the volume in case that happens so that EKS can recreate it in the new AZ where ec2 is rescheduled
somehow from percona operator we make sure if there is a conflict like this the operator deletes and recreates the volume into the correct AZ.

ahad_khan · April 17, 2024, 5:31pm

Hello @Sergey_Pronin,

Here we are back with the update

we were successfully able to run percona mongodb for last few months on spot instances.
we have our primary for writes on demand and the read replicas on spot instances.
for PVC issue, we forced read replicas to a particular zone, like few replicas in one zone and few in other and so on, so even though it’s stopped due to spot instance, still it will back as the instance comes back up in that zone, and we also have multiple zones for fault tolerance.
with this we are able tot save 60 percent cost on read replicas.
all in alls everything works smoothly.

Sergey_Pronin · April 17, 2024, 6:23pm

Hello @ahad_khan ,

this is great to hear that everything worked! It is really cool that you use spot nodes in prod for databases. Valid use case, glad that you figured out the proper setup.

Sergey_Pronin · August 23, 2024, 11:44am

Hey @ahad_khan ,

I was playing with some multi-az deployments and it reminded me about the case you raised.

The way it works in GKE and AWS, is that the CSI driver (csi.storage.gke.io or ebs-csi) creates volumes with nodeAffinity for zonal clusters. For example:

  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - us-central1-b

That way when the pod is getting ready to be scheduled, API checks for Persistent Volume affinity as well. And you would see a message like this:

6 node(s) had volume node affinity conflict

That way pod waits for the node in the zone where PV is bound to. So having a Pod scheduled another AZ is just not possible with modern CSIs.

borov_friss · April 29, 2025, 6:27am

Totally see where you’re coming from – it’s a tricky situation with PVCs binding to specific node IPs, especially when juggling spot and on-demand instances. Deleting the pod is a workaround for now, but it would be ideal if the Percona operator handled reattachment automatically after a crash.

Topic		Replies	Views
Pods in Pending state - 0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod Percona Operator for MongoDB mongodb , psmdb-operator	6	6286	December 22, 2023
The operator is not updating the config until the PVC removed and recreated Percona Operator for MongoDB	7	737	August 3, 2023
GKE, maintenance, Volume is already exclusively attached to one node and can't be attached to another Percona Operator for MongoDB	1	60	July 15, 2024
Percona MongoDB instance don't launch with replicas after downtime Percona Operator for MongoDB	4	890	September 5, 2023
Single StatefulSet for all replica Percona Operator for MySQL	3	103	June 27, 2024

Pvc taint with IP of node, pod stuck in pending in case of node restart/crash

Related topics