Pvc taint with IP of node, pod stuck in pending in case of node restart/crash

the issue is with the PVC, they are being automatically tainted with IP of the node on which the corresponding pod was scheduled, in case of a pod crash/restart, especially with using a few spot instances of additional capacity. the pod gets stuck in a pending state because of volume affinity conflict. currently deleting the pod fixes the issue, but in an ideal scenario this this should be automatic.

the larger point is using SPOT instances could save 70% money with better firepower as the pod/pvc get scheduled anyway automatically. meanwhile, some portions of the cluster are running of on demand nodes.

latest version of percona operator.
aws eks 1.27
mix of on demand and spot instances. mongos and few replicaset on on demand and rest on spot.

was able to solve this by putting
volume.kubernetes.io/selected-node: None

hope this won’t break some other thing.

btw is this a good way to go ?? I am getting sometime this Multi-Attach error for volume, don’t know why that is happening as it’s the same pod. just when the instance dies and it gets reprovisioned it throws this error.

cost and database performance wise this could be a very good idea, only that the DB should neve go down as anyway some of it will be on demand instance.

Hey @ahad_khan ,

thanks for submitting the question.
Interesting case.

  1. When you say that you use PVCs - do you use regular AWS EBS volumes or some other technology?
  2. Are you running across multiple Availability Zones? It might happen that AWS runs out of specific spot nodes in one AZ. Then you face a situation that the node is in AZ-1, but the volume was created in AZ-2.

Hello @Sergey_Pronin,

thanks for the response.

  1. as for pvc yes we just deploy the volumes with the default EBS volumes, without many changes to the default values.yaml using helm.
  2. yes you are right, this does seem the most probable explanation, currently, there are three issues actually.
    a. one is the node affinity issue when the AZ/IP affinity is attached to the PVC/pod, then it won’t schedule and gets stuck, this I believe could be solved using NONE in the values.yaml file for the respective tags.
    b. yes it seems like PVC in one AZ can’t attach in case the node is scheduled in another region, now either we can automatically delete the PVC and recreate it, or just make sure the spot instance is in the same AZ.
    c. the third regarding the multi-attach issue it seems after further digging we realize this might be happening because we were forcefully terminated instances from the ec2 console for testing, in which case the EKS doesn’t get time to put the PVC in detached mode, which could be the most probable cause of this issue. any other ideas are of course appreciated.

hey @ahad_khan ,

for multi-AZ problem pls have a look at topology aware volumes: Troubleshooting topology aware volumes in EKS | AWS re:Post

As for forcefull termination - yeah, this might be the issue. As long as you use spot instances - do you also use spot-termination handler (like GitHub - aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within Kubernetes) ? With it you will have time to drain the node before it is terminated.

1 Like

Hello @Sergey_Pronin,
thanks for the prompt response.

  1. we have thoroughly gone through topology-aware volumes, but it feels like this wouldn’t solve our issue. As per our understanding, topology-aware volumes help in making sure that the first time the pod is scheduled the volume is created only in the AZ where the node is created. in our case when the node is rescheduled into another AZ, our understanding is that this won’t delete and create the PVC into the corresponding AZ. please advise if this is correct !!
  2. yes we have tested this and concluded that in case the new spot node is rescheduled into the same AZ, then EKS gracefully handles pod replacement and connecting to same the PVC.

As for solutions, we feel there are only three at the moment.

  1. we make sure that the node is created into the same AZ forcefully by specifying that in the node group constraints. though this isn’t a good idea for obvious regions of fault tolerance.
  2. we write another pod/lambda which will keep checking for node affinity conflict and will delete the volume in case that happens so that EKS can recreate it in the new AZ where ec2 is rescheduled
  3. somehow from percona operator we make sure if there is a conflict like this the operator deletes and recreates the volume into the correct AZ.

Hello @Sergey_Pronin,

Here we are back with the update

  1. we were successfully able to run percona mongodb for last few months on spot instances.
  2. we have our primary for writes on demand and the read replicas on spot instances.
  3. for PVC issue, we forced read replicas to a particular zone, like few replicas in one zone and few in other and so on, so even though it’s stopped due to spot instance, still it will back as the instance comes back up in that zone, and we also have multiple zones for fault tolerance.
  4. with this we are able tot save 60 percent cost on read replicas.
  5. all in alls everything works smoothly.

Hello @ahad_khan ,

this is great to hear that everything worked! It is really cool that you use spot nodes in prod for databases. Valid use case, glad that you figured out the proper setup.