Pg_wal is filling up disk

martin.schack · September 4, 2023, 5:27am

Description:

Our /pgdata directory is getting 100% full when data is loaded to the database.
The disks are ~200G in size, which is fine for the database, but with the initial load pg_wal fills up the disk to 100%.
pg_wal directory occupies more than150G whereas base is below 50G still.
We’re using two replicas, and during the load they have replication lag of 0 MB, but still pg_wal is not cleaned up in time.

What kind of parameters can we use to enforce faster pg_wal backup/cleanup.

Steps to Reproduce:

insert a lot of data into a new and empty database

Version:

Operator 1.4
Postgres version: 13.10

Logs:

Expected Result:

pg_wal that is no longer needed for replication should be cleaned up, and disks should be empty.

Actual Result:

primary database crashes and database is down

Additional Information:

Jobin_Augustine · September 4, 2023, 8:05am

The most common reason for pg_wal directory to fill up is : WAL archiving not working.
Please inspect the PostgreSQL logs.

Robert_Bernier · September 4, 2023, 1:29pm

As well, depending upon whether it’s a “push” or “pull” you might want to query pg_stat_activity and check for connections meant to pull the WALS. Although a “push” usually means you are using “archive_command”

martin.schack · September 7, 2023, 5:33am

Hi,
Thank you for your answers.
Unfortunately I cannot see any errors in postgres logs.

We’re using percona operator and all the standards that the operator uses for the backup.
So we’re using:
archive_command: source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-archive-push-local-s3.sh %p

How’s the cleanup configured?
Is there any configuration possible to influence backup and cleanup when using Percona Postgres Operator?

martin.schack · September 20, 2023, 9:21am

Hi,

Finally I found the root cause.
When I tried to execute the archive command manually I received:

ERROR: [045]: WAL file ‘0000008D00000410000000CE’ already exists in the repo1 archive with a different checksum
command terminated with exit code 45

With that information I removed the affected backup from pgbackrest repository, and the archiving continued as expected.

I’m no PostgreSQL expert today.
However I wonder how this can happen. Looks like WALs from the same timeline and different checksum have been backed up before.
Maybe that the archiving was interrupted by a system-downtime.
But I found that 3 archive backups were affected, and I had to delete all the three of them.

I still don’t understand the root cause.
At least I’m able to perform a manual cleanup and start another full backup.

best regards,
Martin

Robert_Bernier · September 22, 2023, 1:43pm

Hello,

I still don’t understand the root cause.

I suggest you look at the WAL original timestamps, maybe using stat, and compare it to the events of the various changes of state.

Topic		Replies	Views
Patroni: WAL Files are not deleted after 20 GB pg_restore on primary node PostgreSQL	7	1152	August 8, 2024
Percona postgresql \| Backup and Restore \| Percona Operator for PostgreSQL percona , postgres , postgresql	7	1067	July 6, 2023
Pgbackrest problem with operator 1.4.1 Percona Operator for PostgreSQL	3	1110	May 30, 2023
How can we configure the PostgreSQL14 wal retention policy? PostgreSQL	3	1478	August 1, 2022
Postgres13.2 disk was full postgres: archiver failed PostgreSQL	9	4494	November 2, 2022