I recently faced an issue after restoring a single table in our database and I’m hoping for some guidance.
Context:
Environment: Using PostgreSQL 16 with Percona Patroni 3.3.0 cluster for replication.
Task: Restored a single table (~20 GB) using pg_restore.
Duration: The restore process took around 8 hours.
Issue: Post-restore, I noticed that the WAL (Write-Ahead Logging) files are not being deleted from disk. Upon further investigation, I found that the percona_cluster_1 replication slot is several bytes behind and marked as inactive.
WAL files continue to accumulate, consuming significant disk space.
percona_cluster_1 replication slot shows as inactive and is lagging by a few bytes.
No apparent replication activity to clear or advance the slot.
There are several things required before WAL file can be removed.
WAL files required during crash recovery cannot be archived. So if you have some very long transaction, it will prevent your WALs from being removed, because entire transaction will have to be rolled back, so all this information will have to be read from WALs.
Checkpoint has to be completed on a WAL file before it can be removed.
It have to be archived before it can be removed, so if your archive_command is not working properly it may cause WAL files to pile up.
Replication slots with “wal_status” as reserved, also prevent WALs from being removed. If a slot is inactive, and you do not plan to use it for anything currently, it may be smart to drop the replication slot, and create new one when you will need it again.
wal_keep_segments / wal_keep_size / min_wal_size parameters makes sure a certain number of WAL files is always present in pg_wal directory.
To answer your questions:
See above reply, regarding required steps before WAL can be removed.
To make slot active again, you would need to connect some replica, or application that uses the slot, to consume changes and advance it further. If you cannot do that, just drop the slot and recreate it when you will need it again.
select pg_drop_replication_slot('name');
You never manually remove any WAL files. Make sure steps from above are done, so WALs are archived, there was a checkpoint run, there are no inactive replication slots, etc.
Yes, in many cases it is “normal”, if some of above conditions for WAL removal are not met.
Query pg_stat_replication view. If this is a physical streaming replication you will see all information about your replicas.
select * from pg_stat_replication;
state should be “streaming”
You will also find information about transactions which were sent, written, flushed and replayed on the replica, and information about lags for each of them.
Hi @mateusz.henicz, thank you for the detailed answer. The replication slot was marked as wal_status: extended, with several bytes behind. I tried issuing CHECKPOINT commands, but nothing got Postgres to update that slot. I have finally dropped the replication slot in question, following your recommendation, but it was automatically created soon after. It has been 20 hours since I did that, and the slot is again several bytes behind, and marked as active: false and wal_status: extended.
Which brings the question: Is something wrong with my Patroni cluster configuration? The replicas are all marked as stable, and their data seems fine.
It looks to be an additional physical replication slot added to your patroni configuration.
You should have something like
slots:
percona_cluster_1:
type: physical
in your patroni config.
Sorry if the syntax is not exact, I did that from memory, but if you have this in your patroni configuration, please delete it, reload configuration, using patronictl -c /config/file.yml reload and see if the slot disappears.
Hey,
Yes, it is completely not needed. It probably got there by mistake.
I have sent a Pull Request with changes that remove the replication slot definition from a configuration file.
Thanks for pointing that out!