Hello, we are having some issues with psmdb replica set. Since we updated to 1.14 operator and psmdb this happened couple of times on 2 different Kubernetes clusters.
At random, replica set node is reporting a connection error, then we see “msg”:“SSL peer certificate validation failed”,“attr”:{“reason”:“self signed certificate”} error, and all 3 nodes are down, pods are restarting all the time until we kill all 3 pods and rs is restarted node by node.
These are the first error when it happened by ReplicaSetMonitor-TaskExecutor
Operator is stating the cluster was changed and throws this error:
Thank you for your help
Hi @Slavisa_Milojkovic,
Does the issue happens on its own or after a change to custom resource? And could you please share full operator logs so we can investigate?
It happened 2 times on its own, no changes were made to psmdb. After the helm update, it worked fine for some days and then this happened out of the blue.
These are operator logs from when I restarted the replicaset rs0 node pods.
i attached the latest operator log
psmdb-operator-b85979d4f-4tt4j.log (37.4 KB)
If you can share the operator logs when the problem happened, it’d be much more helpful.
Also, could you please share your cr.yaml (kubectl get psmdb psmdb-db -o yaml
) and helm values.yaml?
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDB"}
meta.helm.sh/release-name: psmdb-db
meta.helm.sh/release-namespace: percona
creationTimestamp: "2023-02-03T11:06:26Z"
finalizers:
- delete-psmdb-pods-in-order
generation: 4
labels:
app.kubernetes.io/instance: psmdb-db
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: psmdb-db
app.kubernetes.io/version: 1.14.0
helm.sh/chart: psmdb-db-1.14.0
name: psmdb-db
namespace: percona
resourceVersion: "45983449"
uid: 5114271a-317c-4bc9-8abb-ca08ba45750f
spec:
backup:
enabled: true
image: percona/percona-backup-mongodb:2.0.4
pitr:
compressionLevel: 6
compressionType: gzip
enabled: true
oplogSpanMin: 10
serviceAccountName: percona-server-mongodb-operator
storages:
minio:
s3:
bucket: k8s-pd-prod-percona
credentialsSecret: psmdb-backup-minio
endpointUrl: https://os-us-west-1.webprovise.io:9000/
prefix: prod_
region: us-west-1
type: s3
tasks:
- compressionType: gzip
enabled: true
keep: 10
name: daily-minio
schedule: 0 0 * * *
storageName: minio
crVersion: 1.14.0
image: percona/percona-server-mongodb:6.0.4-3
imagePullPolicy: Always
multiCluster:
enabled: false
pause: false
pmm:
enabled: false
image: percona/pmm-client:2.35.0
serverHost: monitoring-service
replsets:
- affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
arbiter:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
enabled: false
size: 1
expose:
enabled: false
exposeType: ClusterIP
name: rs0
nonvoting:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
enabled: false
podDisruptionBudget:
maxUnavailable: 1
resources:
limits:
cpu: 300m
memory: 0.5G
requests:
cpu: 300m
memory: 0.5G
size: 3
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 3Gi
podDisruptionBudget:
maxUnavailable: 1
priorityClassName: system-node-critical
resources:
limits:
cpu: 1000m
memory: 1G
requests:
cpu: 300m
memory: 0.5G
size: 3
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 3Gi
secrets:
encryptionKey: psmdb-mongodb-encryption-key
users: psmdb-users
sharding:
configsvrReplSet:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
expose:
enabled: false
exposeType: ClusterIP
podDisruptionBudget:
maxUnavailable: 1
priorityClassName: system-node-critical
resources:
limits:
cpu: 300m
memory: 0.5G
requests:
cpu: 300m
memory: 0.5G
size: 3
volumeSpec:
persistentVolumeClaim:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 3Gi
storageClassName: nfs-client
enabled: true
mongos:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
expose:
exposeType: ClusterIP
podDisruptionBudget:
maxUnavailable: 1
priorityClassName: system-node-critical
resources:
limits:
cpu: 300m
memory: 0.5G
requests:
cpu: 300m
memory: 0.5G
size: 3
unmanaged: false
updateStrategy: SmartUpdate
upgradeOptions:
apply: disabled
schedule: 0 2 * * *
setFCV: false
versionServiceEndpoint: https://check.percona.com
status:
conditions:
- lastTransitionTime: "2023-04-10T01:58:30Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T02:04:00Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T02:04:00Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T02:09:59Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T02:09:59Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T02:15:31Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T02:15:31Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T02:38:31Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T02:38:31Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T06:26:53Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T06:26:53Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T06:43:35Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T06:43:35Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T06:49:17Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T06:49:17Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T07:13:16Z"
reason: MongosReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T07:13:16Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T07:13:26Z"
message: 'rs0: ready'
reason: RSReady
status: "True"
type: ready
- lastTransitionTime: "2023-04-10T07:13:26Z"
status: "True"
type: initializing
- lastTransitionTime: "2023-04-10T07:13:32Z"
status: "True"
type: ready
host: psmdb-db-mongos.percona.svc.cluster.local
mongoImage: percona/percona-server-mongodb:6.0.4-3
mongoVersion: 6.0.4-3
mongos:
ready: 3
size: 3
status: ready
observedGeneration: 4
ready: 9
replsets:
cfg:
initialized: true
ready: 3
size: 3
status: ready
rs0:
added_as_shard: true
initialized: true
ready: 3
size: 3
status: ready
size: 9
state: ready
I’ve updated the first post with operator logs from when the issue started.
I only see symptoms (rs pods are not ready) in the logs but nothing that can be the cause. Could you please provide:
- Your Kubernetes platform and version
- Full logs from each replicaset pod from the time the problem happened
- Full logs from operator pod from the time the problem happened
Also, do you have something else running on the Kubernetes cluster? Did you see any other problems? Is there any chance that Kubernetes cluster had any problems (e.g. network issues)?
Here are the values file. We are using sealed secrets for creds and encryption key, could that be the problem?
psmdb-values.txt (10.3 KB)
We are using sealed secrets for creds and encryption key, could that be the problem?
I doubt it. At least it doesn’t explain pods becoming unready randomly.
I see you’re using NFS for persistent volumes. Is there any chance you lose connectivity to your NFS server, even for a short period?
Could be possible. Most errors are connection issues. I’ll monitor it further and get back to you
1 Like
Hello, I still have this issue. I narrowed it down and it seems that the operator can’t recover the cluster once it starts the smart update. Can this update be disabled and how it detects that statefulset is not up to date?
So when this happens, 2 psmdb-cfg pods are restarted and the third one is not (age is different), also all of the 3 mongos pods are displaying readiness error
When I delete the third cfg pod, operator continues to restart rs0 pods one by one and mongos one by one. It seems to me it start restarting cfg pods and it fails on the third one, once I delete that one and is recreated, operator continues restarting the rest of the psmdb pods. Please check the attached log and help me fix this issue if possible. Thank you
psmdb-operator-6f9bf9585c-4xnpc.log (54.3 KB)