Hi, we had a cluster crash 8 days ago which I thought was resolved after running kubectl -n pxc exec cluster1-pxc-2 -c pxc -- sh -c 'kill -s USR1 1' , but something is still not quite right.
It’s a 3 nodes cluster, cluster1-pxc-0 and cluster1-pxc-1 have /var/lib/mysql folder 66% full, but the cluster1-pxc-2 node has it 100% full. Looking in the /var/lib/mysql folder of cluster1-pxc-2, there are many binlog and cluster1-pxc-2-relay-bin and GRA_ files. The other two nodes only have 7 binlog files, but node 3 has 18.
Earlier today I found the cluster not accepting more connections complaining with “too many connections”.
Connecting locally as root on cluster1-pxc-0 and running show processlist; showed 138 connections, most of them trying to insert data into a table and reporting this status message “wsrep: replicating and certifying write set”.
I then realized that cluster1-pxc-2 had the disk full.
I tried increasing spec.pxc.volumeSpec.persistentVolumeClaim.resources.requests.storage in deploy/cr.yaml from 10G to 12G and applied with kubectl but the PVCs for the cluster are still set to 10GB.
I terminated the cluster1-pxc-2 pod and when it restarted 148 MB in /var/lib/mysql were free, so right now the cluster is up, but the free space is slowly going down.
Version:
Percona Operator for MySQL based on Percona XtraDB Cluster v1.16.1
Logs:
> kubectl get pxc -n pxc
NAME ENDPOINT STATUS PXC PROXYSQL HAPROXY AGE
cluster1 10.22.208.29 ready 3 3 337d
> kubectl get pods -n pxc
NAME READY STATUS RESTARTS AGE
cluster1-haproxy-0 2/2 Running 24 (7d21h ago) 29d
cluster1-haproxy-1 2/2 Running 0 7d20h
cluster1-haproxy-2 2/2 Running 24 (7d21h ago) 29d
cluster1-pxc-0 1/1 Running 1 (7d6h ago) 7d20h
cluster1-pxc-1 1/1 Running 2 (7d6h ago) 29d
cluster1-pxc-2 1/1 Running 1 (29m ago) 29m
percona-xtradb-cluster-operator-6d75d68c9d-s7v8s 1/1 Running 0 7d20h
xb-cron-cluster1-fs-pvc-202521211016-cv29g-g7zcm 0/1 Completed 0 7d6h
xb-cron-cluster1-fs-pvc-202521311016-cv29g-2dtfb 0/1 Completed 0 6d10h
xb-cron-cluster1-fs-pvc-202521411016-cv29g-xk92t 0/1 Completed 0 5d10h
xb-cron-cluster1-fs-pvc-202521511016-cv29g-qhqch 0/1 Completed 0 4d10h
xb-cron-cluster1-fs-pvc-202521611016-cv29g-b4xt7 0/1 Completed 0 3d10h
xb-cron-cluster1-fs-pvc-202521711016-cv29g-hq8t2 0/1 Completed 0 2d10h
xb-cron-cluster1-fs-pvc-202521811016-cv29g-4xtpl 0/1 Completed 0 34h
xb-cron-cluster1-fs-pvc-202521911016-cv29g-lvqlq 0/1 Completed 0 10h
> kubectl get events -n pxc
LAST SEEN TYPE REASON OBJECT MESSAGE
40m Normal Killing pod/cluster1-pxc-2 Stopping container pxc
30m Warning Unhealthy pod/cluster1-pxc-2 Readiness probe failed: + [[ '' == \P\r\i\m\a\r\y ]]...
30m Normal Scheduled pod/cluster1-pxc-2 Successfully assigned pxc/cluster1-pxc-2 to k8s10444-workers-b279j-6xp7w-xpk2f
30m Normal Pulling pod/cluster1-pxc-2 Pulling image "percona/percona-xtradb-cluster-operator:1.16.1"
30m Normal Pulled pod/cluster1-pxc-2 Successfully pulled image "percona/percona-xtradb-cluster-operator:1.16.1" in 566ms (566ms including waiting)
30m Normal Created pod/cluster1-pxc-2 Created container pxc-init
30m Normal Started pod/cluster1-pxc-2 Started container pxc-init
29m Normal Pulling pod/cluster1-pxc-2 Pulling image "docker-remote.binrepo.example.com/percona/percona-xtradb-cluster:8.0.39-30.1"
30m Normal Pulled pod/cluster1-pxc-2 Successfully pulled image "docker-remote.binrepo.example.com/percona/percona-xtradb-cluster:8.0.39-30.1" in 1.664s (1.664s including waiting)
29m Normal Created pod/cluster1-pxc-2 Created container pxc
29m Normal Started pod/cluster1-pxc-2 Started container pxc
29m Normal Pulled pod/cluster1-pxc-2 Successfully pulled image "docker-remote.binrepo.example.com/percona/percona-xtradb-cluster:8.0.39-30.1" in 1.093s (1.093s including waiting)
29m Warning Unhealthy pod/cluster1-pxc-2 Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.5.180:33062' (111)...
30m Warning RecreatingFailedPod statefulset/cluster1-pxc StatefulSet pxc/cluster1-pxc is recreating failed Pod cluster1-pxc-2
30m Normal SuccessfulDelete statefulset/cluster1-pxc delete Pod cluster1-pxc-2 in StatefulSet cluster1-pxc successful
30m Normal SuccessfulCreate statefulset/cluster1-pxc create Pod cluster1-pxc-2 in StatefulSet cluster1-pxc successful
Thank you, I was able to expand the PVCs so that bought me some time. I was missing the enableVolumeExpansion: true option.
Now it remain to figure out why cluster1-pxc-2 is not deleting the old logs, which is what caused it to run out of space.
Or were you recommending to follow the “Manual scaling without Volume Expansion capability” steps as a way to delete cluster1-pxc-2 and its storage and have it recreated?
If you use operator version higher then 1.14.0 you can use Automated scaling with Volume Expansion capability. To use manual scaling is also an option for sure.
As I understand you expanded pvcs on all nodes. Do you still have the problem with logs on cluster1-pxc-2?
Yes, expanded PCVs on all nodes using the automated scaling, but still have the problem with logs on cluster1-pxc-2. There are many more binlog.000XXX files in that node.
The oldest binlog on the other nodes is from Feb 12 2025, but on cluster1-pxc-2 it’s from Feb 4 2025.
As far as I can tell, replication is fine (I connected to each node individually and ran select queries and they all show the same data, and I can’t spot anything wrong in the show status like ‘wsrep_%’ output).
So perhaps if I delete the cluster1-pxc-2 node and its PVC using the procedure explained in “Manual scaling without Volume Expansion capability” it will fix itself… or create a singularity that will wipe everything
Thanks for looking into this.
Yes, using the PXC cluster.
I didn’t explicitly enable bin logs on it, I thought it came standard. Months ago I reduced binlog_expire_logs_seconds to 14 days instead of the default 30 because I didn’t see the usefulness of it. The first 2 nodes hold 14 days of logs, with a log file getting created approx every two days, while the 3rd node holds many more days.
This is the deploy/cr.yaml I’ve been using, with minimal redactions
got it - looks like binlog_expire_logs_seconds is the replacement for expire_logs_dayshttps://dev.mysql.com/worklog/task/?id=10924 so I should be all set on that front (and it works as expected on two out of the three nodes)
Before that check the binary logs with below command
show binary logs;
if those listed here and the required parameters are in place then flush logs will clear all the obsolete files if they are not then it must be removed from mysql-bin.index file but not from disk then you need to remove manually.