Backup failure, missing sst_info

Hey folks!

First of all, what an amazing tool the pxc operator! good job!!

I’m quite new in this land of galera, xtradb clusters, xtradb backups and so on and so forth, so chances are that I’m doing something wrong so I was hoping I could get some guidance.

Summary

While testing backups, I’m running into an issue where sst_info seems missing in /tmp

Context

Environment

  • GKE cluster running kubernetes 1.22
  • Operator version 1.11.0
  • Backup image 1.11.0-pxc8.0-backup
  • Backup storage GCS running with the interoperability mode. Secret created with the valid access and secret key.
  • PXC image 8.0.19-10.1
  • 3 pxc replicas and 3 haproxy replicas

Kubernetes Manifests

# [...]
storages:
 bucket:
   s3:
     bucket: REDACTED
     credentialsSecret: mysql-backup-creds
     endpointUrl: https://storage.googleapis.com/
     region: us-central1
   type: s3

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  name: my-test-backup
spec:
  pxcCluster: my-db
  storageName: bucket

Troubleshooting

This is the cluster view since the backup started (still is as it’s stuck):

mysql> SELECT * FROM performance_schema.pxc_cluster_view;
+-------------+--------------------------------------+--------+-------------+---------+
| HOST_NAME   | UUID                                 | STATUS | LOCAL_INDEX | SEGMENT |
+-------------+--------------------------------------+--------+-------------+---------+
| my-db-pxc-2 | 06f4724b-49af-11ed-8292-a7b0a16e15a3 | SYNCED |           0 |       0 |
| garb        | 47ca8f34-49af-11ed-a18f-c649d2df1f12 | JOINER |           1 |       0 |
| my-db-pxc-0 | d5645f2f-49ae-11ed-92e9-4a30749a91f7 | SYNCED |           2 |       0 |
| my-db-pxc-1 | ee351090-49ae-11ed-8c85-463be0b37fd1 | SYNCED |           3 |       0 |
+-------------+--------------------------------------+--------+-------------+---------+
4 rows in set (0,12 sec)

The backup pod created by the backup job is stuck at:

2022-10-11 21:54:44.857  INFO: [SST script] + socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio

Scrolling up in the backup pod logs:

2022-10-11 21:54:44.520  INFO: [SST script] + xbstream -C /tmp -c sst_info

2022-10-11 21:54:44.515  INFO: 0.0 (my-db-pxc-2): State transfer to 1.0 (garb) complete.
2022-10-11 21:54:44.521  INFO: Member 0.0 (my-db-pxc-2) synced with group.
2022-10-11 21:54:44.513  INFO: [SST script] Socat was started

2022-10-11 21:54:44.525  INFO: [SST script] xbstream: Can't get stat of 'sst_info' (OS errno 2 - No such file or directory)

2022-10-11 21:54:44.852  INFO: [SST script] 221011 21:54:44 xbcloud: Successfully connected.

2022-10-11 21:54:44.852  INFO: [SST script] 221011 21:54:44 xbcloud: Upload failed.

Execing into the backup pod and executig:

xbstream -C /tmp -c sst_info

I can reproduce the issue above:

sh-4.4$ xbstream -C /tmp -c sst_info                                     
xbstream: Can't get stat of 'sst_info' (OS errno 2 - No such file or dire
ctory)                                                                   

Navigating the /tmp folder I can see some mysql related files, like a test database I created being backed up, binlog, xtrabackup files, my.cnf backups, etc.

Is there anything I’m doing wrong? or where should I look for more troubleshooting information?

Thanks again!!

2 Likes

Hey folks!

I found the issue! And it turns out that missing sst_info was the sympton and not the root cause :sweat_smile:

So we’re using Linkerd as our service mesh. We have a policy rule that does not enable linkerd when a pod is triggered by a Job. Since the backup job listens on port 4444 and as I mentioned above, it was stuck at:

socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio

So I suspected it was waiting for some packets to come and maybe linkerd was in the middle of it.

I added port 4444 to the list of ports to skip when doing outbound connections and it worked!

So I think we can close this, though it might be interesting to others!

2 Likes