Pg_wal is filling up disk

Description:

Our /pgdata directory is getting 100% full when data is loaded to the database.
The disks are ~200G in size, which is fine for the database, but with the initial load pg_wal fills up the disk to 100%.
pg_wal directory occupies more than150G whereas base is below 50G still.
We’re using two replicas, and during the load they have replication lag of 0 MB, but still pg_wal is not cleaned up in time.

What kind of parameters can we use to enforce faster pg_wal backup/cleanup.

Steps to Reproduce:

insert a lot of data into a new and empty database

Version:

Operator 1.4
Postgres version: 13.10

Logs:

Expected Result:

pg_wal that is no longer needed for replication should be cleaned up, and disks should be empty.

Actual Result:

primary database crashes and database is down

Additional Information:

The most common reason for pg_wal directory to fill up is : WAL archiving not working.
Please inspect the PostgreSQL logs.

As well, depending upon whether it’s a “push” or “pull” you might want to query pg_stat_activity and check for connections meant to pull the WALS. Although a “push” usually means you are using “archive_command”

Hi,
Thank you for your answers.
Unfortunately I cannot see any errors in postgres logs.

We’re using percona operator and all the standards that the operator uses for the backup.
So we’re using:
archive_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-archive-push-local-s3.sh %p

How’s the cleanup configured?
Is there any configuration possible to influence backup and cleanup when using Percona Postgres Operator?

Hi,

Finally I found the root cause.
When I tried to execute the archive command manually I received:

ERROR: [045]: WAL file ‘0000008D00000410000000CE’ already exists in the repo1 archive with a different checksum
command terminated with exit code 45

With that information I removed the affected backup from pgbackrest repository, and the archiving continued as expected.

I’m no PostgreSQL expert today.
However I wonder how this can happen. Looks like WALs from the same timeline and different checksum have been backed up before.
Maybe that the archiving was interrupted by a system-downtime.
But I found that 3 archive backups were affected, and I had to delete all the three of them.

I still don’t understand the root cause.
At least I’m able to perform a manual cleanup and start another full backup.

best regards,
Martin

Hello,

I still don’t understand the root cause.

I suggest you look at the WAL original timestamps, maybe using stat, and compare it to the events of the various changes of state.