Questions about updateStrategy

Description:

We have these settings on prod cluster:

updateStrategy: SmartUpdate
upgradeOptions:
  versionServiceEndpoint: https://check.percona.com
  apply: disabled
  schedule: "0 4 * * *"
enableCRValidationWebhook: false

And occasionally smartupdate is triggered with no apparent reason.

Steps to Reproduce:

It happens at random

Version:

8.0.32-24.2

Logs:

2024-05-28T04:01:55+02:00 {"level":"info","ts":1716861715.9441552,"logger":"perconaxtradbcluster","caller":"pxc/version.go:61","msg":"add new job","cluster":"pxc-db","namespace":"percona","schedule":"0 4 * * *"}
2024-05-28T04:01:55+02:00 {"level":"info","ts":1716861715.9442644,"logger":"perconaxtradbcluster","caller":"pxc/version.go:103","msg":"add new job","cluster":"pxc-db","namespace":"percona","name":"ensure-version/percona/pxc-db","schedule":"0 4 * * *"}
2024-05-29T11:46:17+02:00 [mysql] 2024/05/29 09:46:17 packets.go:37: unexpected EOF
2024-05-29T21:17:36+02:00 2024-05-29T19:17:36.192Z	ERROR	Reconciler error	{"controller": "perconaxtradbcluster-controller", "object": {"name":"pxc-db","namespace":"percona"}, "namespace": "percona", "name": "pxc-db", "reconcileID": "ada5f6e3-69b1-4784-8452-f92f80836669", "error": "reconcile users: manage sys users: is old password discarded: select User_attributes field: dial tcp: lookup pxc-db-pxc-unready.percona on 10.43.0.10:53: no such host", "errorVerbose": "dial tcp: lookup pxc-db-pxc-unready.percona on 10.43.0.10:53: no such host\nselect User_attributes field\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/pxc/users.(*Manager).IsOldPassDiscarded\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/pxc/users/users.go:172\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).isOldPasswordDiscarded\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:1017\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).handleClustercheckUser\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:584\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).updateUsers\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:169\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).reconcileUsers\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:107\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:295\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nis old password discarded\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).isOldPasswordDiscarded\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:1019\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).handleClustercheckUser\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:584\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).updateUsers\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:169\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).reconcileUsers\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:107\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:295\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nmanage sys users\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).reconcileUsers\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:109\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:295\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nreconcile users\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}
2024-05-29T21:17:36+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-05-29T21:17:36+02:00 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:326
2024-05-29T21:17:36+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-05-29T21:17:36+02:00 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273
2024-05-29T21:17:36+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-05-29T21:17:36+02:00 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234
2024-05-31T16:41:52+02:00 {"level":"info","ts":1717166512.725894,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:267","msg":"statefulSet was changed, run smart update","cluster":"pxc-db","namespace":"percona"}
2024-05-31T16:41:52+02:00 {"level":"info","ts":1717166512.7617571,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:295","msg":"primary pod","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-0.pxc-db-pxc.percona"}
2024-05-31T16:41:52+02:00 {"level":"info","ts":1717166512.761869,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:312","msg":"apply changes to secondary pod","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-2"}
2024-05-31T16:42:52+02:00 {"level":"info","ts":1717166572.9808805,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:590","msg":"pod is running","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-2"}
2024-05-31T16:42:52+02:00 {"level":"info","ts":1717166572.9911234,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:312","msg":"apply changes to secondary pod","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-1"}
2024-05-31T16:44:13+02:00 {"level":"info","ts":1717166653.1513507,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:590","msg":"pod is running","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-1"}
2024-05-31T16:44:13+02:00 {"level":"info","ts":1717166653.1639895,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:319","msg":"apply changes to primary pod","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-0"}
2024-05-31T16:45:23+02:00 {"level":"info","ts":1717166723.3564456,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:590","msg":"pod is running","cluster":"pxc-db","namespace":"percona","pod name":"pxc-db-pxc-0"}
2024-05-31T16:45:23+02:00 {"level":"info","ts":1717166723.365791,"logger":"perconaxtradbcluster","caller":"pxc/upgrade.go:324","msg":"smart update finished","cluster":"pxc-db","namespace":"percona"}

Expected Result:

MySQL should have no downtime

Actual Result:

PXC pods are restarted one by one

Additional Information:

Our Laravel PHP app has the error “SQLSTATE[HY000]: General error: 2006 MySQL server has gone away”

I think it happens when the network is lagging, and having issues.
I know we can change updateStrategy to manual, but I think it’s not recommended. I also see online that max_allowed_packet can be the issue, but we have it set to 256M

I was wondering is there any setting I can add so that pxc waits for all connections and queries to finish before it is restarted? Or the only option is to disable smartupdate?
Thanks

Hi @Slavisa_Milojkovic, Are you sure you have a restart but CR or secrets were not changed? Or maybe someone changed statefulSet directly.

No, we don’t make any changes on prod before testing on stage, and only a couple of us have prod access, so no one changed the statefulset. Also we never change it manually, only by deploying a changed helm chart. Pxc had uptime 2+ months before this restart.
Is it possible that maybe cert-manager rotated tls certs? Didn’t check.
But I suspect this:

2024-05-29T11:46:17+02:00 [mysql] 2024/05/29 09:46:17 packets.go:37: unexpected EOF

I noticed the net was flaky, and noticed that before too. It is also happening with psmdb operator, but not with postgres-operator. We use them all on different clusters.

Can we somehow extend the timeout, or retry function that operator use when checking for statefulset, so we can avoid network issues (network was available, but was lagging, maybe there was some VPN or router temporary problem)

Or make the active (master) pxc pod wait for active queries/connections to finish before it is restarted?

Or make the active (master) pxc pod wait for active queries/connections to finish before it is restarted?

We do not have such functionality now. The SmartUpdate strategy just detects the primary pod and restarts it last.

We will try to simulate the network issue to reproduce this problem.

From your end please check the diff of sts, maybe you can find something
kubectl rollout history statefulset <sts_name> --revision=<last_revision_number>
kubectl rollout history statefulset <sts_name> --revision=<previos_revision_number>

@Slava_Sarzhan When I compare sts there are differences in SSL-hash and SSL-internal-hash values (configuration-hash and all other fields are always the same):

  Annotations:	percona.com/configuration-hash: XXXXXXXXXXXXXXXXXXXXXXXXXXX
	percona.com/ssl-hash: XXXXXXXXXXXXXXXXXX
	percona.com/ssl-internal-hash: XXXXXXXXXXXXXXXXXXXXX
statefulset.apps/pxc-db-pxc 
REVISION  CHANGE-CAUSE
1         <none>
2         <none>
3         <none>
4         <none>
5         <none>
6         <none>
7         <none>
8         <none>
9         <none>
10        <none>
11        <none>
12        <none>

Do you use cert-manager? it could be a certificate renewal process. Could you please check your certificates

1 Like

Yes, I see both pxc certs were renewed at the same time as the reboot occurred.

  notAfter: '2024-08-29T14:41:50Z'
  notBefore: '2024-05-31T14:41:50Z'
  renewalTime: '2024-07-30T14:41:50Z'
  revision: 10

I don’t suppose we can manually edit the renewal period since the pxc issuer is controlled by the operator? It would be a nice feature to be able to configure the SSL cert renewal period. I would extend it to one year probably.
Thanks for the help.