Description:
Our /pgdata directory is getting 100% full when data is loaded to the database.
The disks are ~200G in size, which is fine for the database, but with the initial load pg_wal fills up the disk to 100%.
pg_wal directory occupies more than150G whereas base is below 50G still.
We’re using two replicas, and during the load they have replication lag of 0 MB, but still pg_wal is not cleaned up in time.
What kind of parameters can we use to enforce faster pg_wal backup/cleanup.
Steps to Reproduce:
insert a lot of data into a new and empty database
Version:
Operator 1.4
Postgres version: 13.10
Logs:
Expected Result:
pg_wal that is no longer needed for replication should be cleaned up, and disks should be empty.
Actual Result:
primary database crashes and database is down
Additional Information:
The most common reason for pg_wal directory to fill up is : WAL archiving not working.
Please inspect the PostgreSQL logs.
As well, depending upon whether it’s a “push” or “pull” you might want to query pg_stat_activity and check for connections meant to pull the WALS. Although a “push” usually means you are using “archive_command”
Hi,
Thank you for your answers.
Unfortunately I cannot see any errors in postgres logs.
We’re using percona operator and all the standards that the operator uses for the backup.
So we’re using:
archive_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-archive-push-local-s3.sh %p
How’s the cleanup configured?
Is there any configuration possible to influence backup and cleanup when using Percona Postgres Operator?
Hi,
Finally I found the root cause.
When I tried to execute the archive command manually I received:
ERROR: [045]: WAL file ‘0000008D00000410000000CE’ already exists in the repo1 archive with a different checksum
command terminated with exit code 45
With that information I removed the affected backup from pgbackrest repository, and the archiving continued as expected.
I’m no PostgreSQL expert today.
However I wonder how this can happen. Looks like WALs from the same timeline and different checksum have been backed up before.
Maybe that the archiving was interrupted by a system-downtime.
But I found that 3 archive backups were affected, and I had to delete all the three of them.
I still don’t understand the root cause.
At least I’m able to perform a manual cleanup and start another full backup.
best regards,
Martin
Hello,
I still don’t understand the root cause.
I suggest you look at the WAL original timestamps, maybe using stat, and compare it to the events of the various changes of state.