the issue is with the PVC, they are being automatically tainted with IP of the node on which the corresponding pod was scheduled, in case of a pod crash/restart, especially with using a few spot instances of additional capacity. the pod gets stuck in a pending state because of volume affinity conflict. currently deleting the pod fixes the issue, but in an ideal scenario this this should be automatic.
the larger point is using SPOT instances could save 70% money with better firepower as the pod/pvc get scheduled anyway automatically. meanwhile, some portions of the cluster are running of on demand nodes.
latest version of percona operator.
aws eks 1.27
mix of on demand and spot instances. mongos and few replicaset on on demand and rest on spot.
was able to solve this by putting
annotations:
volume.kubernetes.io/selected-node: None
hope this won’t break some other thing.
btw is this a good way to go ?? I am getting sometime this Multi-Attach error for volume, don’t know why that is happening as it’s the same pod. just when the instance dies and it gets reprovisioned it throws this error.
cost and database performance wise this could be a very good idea, only that the DB should neve go down as anyway some of it will be on demand instance.
Hey @ahad_khan ,
thanks for submitting the question.
Interesting case.
- When you say that you use PVCs - do you use regular AWS EBS volumes or some other technology?
- Are you running across multiple Availability Zones? It might happen that AWS runs out of specific spot nodes in one AZ. Then you face a situation that the node is in AZ-1, but the volume was created in AZ-2.
hey @ahad_khan ,
for multi-AZ problem pls have a look at topology aware volumes: Troubleshooting topology aware volumes in EKS | AWS re:Post
As for forcefull termination - yeah, this might be the issue. As long as you use spot instances - do you also use spot-termination handler (like GitHub - aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within Kubernetes) ? With it you will have time to drain the node before it is terminated.
1 Like
Hello @Sergey_Pronin,
thanks for the prompt response.
- we have thoroughly gone through topology-aware volumes, but it feels like this wouldn’t solve our issue. As per our understanding, topology-aware volumes help in making sure that the first time the pod is scheduled the volume is created only in the AZ where the node is created. in our case when the node is rescheduled into another AZ, our understanding is that this won’t delete and create the PVC into the corresponding AZ. please advise if this is correct !!
- yes we have tested this and concluded that in case the new spot node is rescheduled into the same AZ, then EKS gracefully handles pod replacement and connecting to same the PVC.
As for solutions, we feel there are only three at the moment.
- we make sure that the node is created into the same AZ forcefully by specifying that in the node group constraints. though this isn’t a good idea for obvious regions of fault tolerance.
- we write another pod/lambda which will keep checking for node affinity conflict and will delete the volume in case that happens so that EKS can recreate it in the new AZ where ec2 is rescheduled
- somehow from percona operator we make sure if there is a conflict like this the operator deletes and recreates the volume into the correct AZ.
Hello @Sergey_Pronin,
Here we are back with the update
- we were successfully able to run percona mongodb for last few months on spot instances.
- we have our primary for writes on demand and the read replicas on spot instances.
- for PVC issue, we forced read replicas to a particular zone, like few replicas in one zone and few in other and so on, so even though it’s stopped due to spot instance, still it will back as the instance comes back up in that zone, and we also have multiple zones for fault tolerance.
- with this we are able tot save 60 percent cost on read replicas.
- all in alls everything works smoothly.
Hello @ahad_khan ,
this is great to hear that everything worked! It is really cool that you use spot nodes in prod for databases. Valid use case, glad that you figured out the proper setup.
1 Like
Hey @ahad_khan ,
I was playing with some multi-az deployments and it reminded me about the case you raised.
The way it works in GKE and AWS, is that the CSI driver (csi.storage.gke.io
or ebs-csi
) creates volumes with nodeAffinity for zonal clusters. For example:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.gke.io/zone
operator: In
values:
- us-central1-b
That way when the pod is getting ready to be scheduled, API checks for Persistent Volume affinity as well. And you would see a message like this:
6 node(s) had volume node affinity conflict
That way pod waits for the node in the zone where PV is bound to. So having a Pod scheduled another AZ is just not possible with modern CSIs.