PXC cluster CrashLoopBackOff

Hi! My PXC 5.7 cluster is crash looping with the following error:

2021-11-22T08:57:27.796235Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.34-37 started; log sequence number 2264189188
2021-11-22T08:57:27.796281Z 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2021-11-22T08:57:27.797037Z 0 [Note] Plugin 'FEDERATED' is disabled.
2021-11-22T08:57:27.809704Z 0 [Note] InnoDB: Starting recovery for XA transactions...
2021-11-22T08:57:27.809727Z 0 [Note] InnoDB: Transaction 12760 in prepared state after recovery
2021-11-22T08:57:27.809731Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2021-11-22T08:57:27.809736Z 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2021-11-22T08:57:27.809739Z 0 [Note] Found 1 prepared transaction(s) in InnoDB
2021-11-22T08:57:27.809753Z 0 [Warning] WSREP: Discovered discontinuity in recovered wsrep transaction XIDs. Truncating the recovery list to 0 entries
2021-11-22T08:57:27.809757Z 0 [Note] WSREP: Last wsrep seqno to be recovered 2656
2021-11-22T08:57:27.809852Z 0 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
2021-11-22T08:57:27.809862Z 0 [ERROR] Aborting

I don’t know what to do, since I’m using the operator installed with the chart pxc-operator and an instance installed with the chart pxc-db.

So, what should I do? And why the operator does not handle automatically this use case? Why only 1 replica of 3 is crash looping while others are OK? And why all the haproxy in front of PXC instances are unready (there is no HA so?)?

1 Like

Hello @Antoine,
I would manually destroy that pod and let the operator recreate it so that it forces a fresh SST from one of the other nodes. Yes, I think the operator should handle this. Can you please open a bug report at https://jira.percona.com with all the config files and other info?

1 Like

Hello @matthewb, thanks for your reply!
I tried to delete the pod, but it still crashes looping with the same error.

I will open a bug request yes, thanks. I’m still a help taker. :slight_smile:

1 Like

I see the default value of innodb_flush_log_at_trx_commit is 0. Could this be the problem? I have a very long transaction each day (about an hour). I think the instance crashed during this one.

The other problem that worries me is that all the haproxy in front of the 3 instances are in CrashLoopBackOff. Why? I think they should be OK because there are 2 of 3 PXC instances which are ready.

1 Like

Hey @Antoine ,

thanks for raising this. Could you please share your values.yaml from helm? I would love to try to reproduce this.

1 Like

Hi @Sergey_Pronin,

The values for the operator are:

resources: {}

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: operator
              app.kubernetes.io/instance: pxc-operator
              app.kubernetes.io/name: pxc-operator
              app.kubernetes.io/part-of: pxc-operator
          topologyKey: kubernetes.io/hostname

The values for the pxc instance are:

finalizers:
  - delete-pxc-pods-in-order
  # Can delete proxysql PVCs, they're recreatable.
  - delete-proxysql-pvc
  # Don't delete database PVCs.
  # - delete-pxc-pvc

upgradeOptions:
  apply: 5.7-recommended

pxc:
  resources:
    requests:
      cpu: "2"
      memory: 4Gi
    limits:
      cpu: "2"
      memory: 4Gi
  affinity:
    advanced:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: pxc
                  app.kubernetes.io/instance: pxc-db
                  app.kubernetes.io/managed-by: percona-xtradb-cluster-operator
                  app.kubernetes.io/name: percona-xtradb-cluster
                  app.kubernetes.io/part-of: percona-xtradb-cluster
              topologyKey: kubernetes.io/hostname
    antiAffinityTopologyKey: kubernetes.io/hostname
  persistence:
    enabled: true
    storageClass: vsphere-delete
    accessMode: ReadWriteOnce
    size: 100Gi
  disableTLS: true
  clusterSecretName: pxc-custom-secret

haproxy:
  resources:
    requests:
      cpu: "1"
      memory: 1Gi
    limits:
      cpu: "1"
      memory: 1Gi
  affinity:
    advanced:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: haproxy
                  app.kubernetes.io/instance: pxc-db
                  app.kubernetes.io/managed-by: percona-xtradb-cluster-operator
                  app.kubernetes.io/name: percona-xtradb-cluster
                  app.kubernetes.io/part-of: percona-xtradb-cluster
              topologyKey: kubernetes.io/hostname
    antiAffinityTopologyKey: kubernetes.io/hostname

logcollector:
  resources:
    requests:
      cpu: "10m"
      memory: 16Mi
    limits:
      cpu: "100m"
      memory: "128Mi"

backup:
  enabled: true
  storages:
    mys3:
      type: s3
      s3:
        credentialsSecret: pxc-backups-custom-secret
        bucket: mybucket
        region: myregion
        endpointUrl: https://minio-endpoint
      resources:
        requests:
          cpu: "2"
          memory: 4Gi
        limits:
          cpu: "2"
          memory: 4Gi
  schedule:
    - name: s3-daily-backup
      schedule: "0 0 * * *"
      keep: 10
      storageName: mys3

I manually created the pxc-custom-secret which contains all users/passwords and pxc-backups-custom-secret for the S3 storage.

I use the chart version 1.9.1 for the operator and the pxc instance.

1 Like

Hello @Antoine , have you solved the problem? I’m having this issue right now.