I’m using AWS EKS service. The etcd parameter max-request-bytes is set to 1.5 Mb and is not editable (even support cannot edit it).
I’ve started a percona operator on a cluster that was not yet ready for it. The PerconaServerMongoDB resource was created and was unsuccessfully trying to start pods. Finally, I fix the issues with the cluster and the pods got started.
Recently I was trying to update the operator but got the error
etcd request too large. It appears all the attempts of starting the cluster are logged in the perconaservermongodbs object .status.conditions.
$ kubectl get perconaservermongodbs project-dev-app1-pmongo -o yaml | grep lastTransitionTime | wc -l
The resource is full of lines like these:
- lastTransitionTime: “2020-01-22T08:43:23Z”
- lastTransitionTime: “2020-01-22T08:43:26Z”
- lastTransitionTime: “2020-01-22T08:43:29Z”
The size of the object is slightly more than 1.5 MB and i cannot edit it. I found no way of removing these lines from the object…
Does anyone have the same issue? Is there a way to edit the CRD object not deleting it?
PS kubectl edit perconaservermongodbs project-dev-app1-pmongo is not working. it reports the object is edited but it is not.
For those who might have the same issue.
AWS support agreed to increase the etcd limit. After they apply the change new lines begin to appear:
- lastTransitionTime: “2020-03-02T05:32:21Z”
- lastTransitionTime: “2020-03-02T05:32:22Z”
and should hit the new limit in a while…
Note: I use terraform to maintain infrastructure. I have a helm module that creates a Percona Operator in k8s. There are 5 clusters of mongo created from the same module.
I’ve re-created the fault cluster and even re-created the PVC of mondo db folder. basically i deleted everything related to the fault CRD instance and created in from scratch. but the issue did not fix. BTW all cluster state was OK, after the recreation when the replica set got synced the state became OK too, no error in any logs (operator\coordinator\mongodb pods or k8s events). I’ve tried to debug but found nothing… so i deleted the fault cluster (one more time - only one cluster have this issue). On the next day, using the same terraform module I created a cluster again and the issue got fixed. Have no idea what was this and why it got fixed…
If someone will face the same issue and find the root cause - will be useful to be mentioned here.
I’ve an assumption but did not have a chance to check: The operator container version might be a faulty one. Maybe
if status.Status != currentRSstatus.Status in a controller had a different logic… this is just my guess…
Thanks for updating this post with our solution, it is much appreciated.
I will share your post with the Kubernetes team so they are aware and in case they would like to add anything.