Error: could not find previous WAL record

Description:

Second time I encounter this. First time, I restored databases from backup but I’m sure there is an easier way.

I really hope someone can help me understand how to deal with this so I won’t have to restore each time this happens.

I upgraded some kubernetes components. Volumes are managed my longhorn and it moves replicas around as I updated Talos Linux and Kubernetes 1.31.2. I always wait for volumes to go through their degraded and back to healthy state before upgrading the next node in the cluster.

Postgres does not seem to like this very much , this is the second time an instance won’t start. I have included full logs with the errors.

Steps to Reproduce:

I’ve upgraded before, I moved pods around and it usually “just works”. This is second time I get this but I would like to prevent it from happening again.

Version:

2.5.0 with PG 16.4.

Logs:

❯ kc logs pods/cluster1-instance1-qqwv-0
Defaulted container "database" out of: database, replication-cert-copy, pgbackrest, pgbackrest-config, postgres-startup (init), nss-wrapper-init (init)
2024-11-17 16:03:10,507 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-11-17 16:03:10,521 WARNING: Postgresql is not running.
2024-11-17 16:03:10,521 INFO: Lock owner: cluster1-instance1-knqr-0; I am cluster1-instance1-qqwv-0
2024-11-17 16:03:10,526 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7427460329854083148
  Database cluster state: shut down
  pg_control last modified: Sun Nov 17 15:45:21 2024
  Latest checkpoint location: 89/EE000028
  Latest checkpoint's REDO location: 89/EE000028
  Latest checkpoint's REDO WAL file: 0000000800000089000000EE
  Latest checkpoint's TimeLineID: 8
  Latest checkpoint's PrevTimeLineID: 8
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:300718
  Latest checkpoint's NextOID: 44743
  Latest checkpoint's NextMultiXactId: 4
  Latest checkpoint's NextMultiOffset: 8
  Latest checkpoint's oldestXID: 722
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 730
  Latest checkpoint's newestCommitTsXid: 300717
  Time of latest checkpoint: Sun Nov 17 15:45:21 2024
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: on
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 368640f0e0bb1c5e7d2e67b9f69c6aba1786c4cc5d9c24cdfb7b7df5e0bf31b4

2024-11-17 16:03:10,527 INFO: Lock owner: cluster1-instance1-knqr-0; I am cluster1-instance1-qqwv-0
2024-11-17 16:03:10,545 INFO: Local timeline=8 lsn=89/EE000028
2024-11-17 16:03:10,560 INFO: primary_timeline=10
2024-11-17 16:03:10,561 INFO: primary: history=5	88/FC0000A0	no recovery target specified
6	89/F0000A0	no recovery target specified
7	89/1F004F10	no recovery target specified
8	89/ED0000E8	no recovery target specified
9	89/F70000A0	no recovery target specified
2024-11-17 16:03:10,614 INFO: running pg_rewind from cluster1-instance1-knqr-0
2024-11-17 16:03:10,651 INFO: running pg_rewind from dbname=postgres user=_crunchyrepl host=cluster1-instance1-knqr-0.cluster1-pods port=5432 sslmode=verify-ca sslcert=/tmp/replication/tls.crt sslkey=/tmp/replication/tls.key sslrootcert=/tmp/replication/ca.crt target_session_attrs=read-write
2024-11-17 16:03:10,707 INFO: pg_rewind exit code=1
2024-11-17 16:03:10,707 INFO:  stdout=
2024-11-17 16:03:10,707 INFO:  stderr=pg_rewind: servers diverged at WAL location 89/ED0000E8 on timeline 8
pg_rewind: error: could not find previous WAL record at 89/ED0000E8: record with incorrect prev-link E09E4000/E09EB000 at 89/ED0000E8

2024-11-17 16:03:10,714 ERROR: Failed to rewind from healty primary: cluster1-instance1-knqr-0
2024-11-17 16:03:10,730 WARNING: Postgresql is not running.
2024-11-17 16:03:10,730 INFO: Lock owner: cluster1-instance1-knqr-0; I am cluster1-instance1-qqwv-0
2024-11-17 16:03:10,735 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7427460329854083148
  Database cluster state: shut down
  pg_control last modified: Sun Nov 17 15:45:21 2024
  Latest checkpoint location: 89/EE000028
  Latest checkpoint's REDO location: 89/EE000028
  Latest checkpoint's REDO WAL file: 0000000800000089000000EE
  Latest checkpoint's TimeLineID: 8
  Latest checkpoint's PrevTimeLineID: 8
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:300718
  Latest checkpoint's NextOID: 44743
  Latest checkpoint's NextMultiXactId: 4
  Latest checkpoint's NextMultiOffset: 8
  Latest checkpoint's oldestXID: 722
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 730
  Latest checkpoint's newestCommitTsXid: 300717
  Time of latest checkpoint: Sun Nov 17 15:45:21 2024
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: on
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 368640f0e0bb1c5e7d2e67b9f69c6aba1786c4cc5d9c24cdfb7b7df5e0bf31b4

2024-11-17 16:03:10,736 INFO: Lock owner: cluster1-instance1-knqr-0; I am cluster1-instance1-qqwv-0
2024-11-17 16:03:10,737 INFO: starting as a secondary
2024-11-17 16:03:11.145 UTC [98] LOG:  pgaudit extension initialized
2024-11-17 16:03:11,167 INFO: postmaster pid=98
/tmp/postgres:5432 - no response
2024-11-17 16:03:11.351 UTC [98] LOG:  redirecting log output to logging collector process
2024-11-17 16:03:11.351 UTC [98] HINT:  Future log output will appear in directory "log".
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - no response
2024-11-17 16:03:14,311 INFO: Lock owner: cluster1-instance1-knqr-0; I am cluster1-instance1-qqwv-0
2024-11-17 16:03:14,340 INFO: failed to start postgres

Expected Result:

A nicely running postgres instance without errors :slight_smile:

Actual Result:

The database must have gotten corrupt or uncleanly shutdown and comes back with:

pg_rewind: error: could not find previous WAL record at 89/ED0000E8: record with incorrect prev-link E09E4000/E09EB000 at 89/ED0000E8
2024-11-17 16:03:10,714 ERROR: Failed to rewind from healty primary: cluster1-instance1-knqr-0

Additional Information:

Perhaps its safer in the future to bring down postgres cluster, but that kind of defeats the purpose of running a HA setup with 3 instances.

I am unable to recover using guides so I will restore once again.

If anyone reads this and has some guidance, I would still appreciate a reply as it will help me in the future.

Thanks!

@Fragment2 can you tell us more what you did to get into this state (or if you have any other good way to reproduce the issue)?

For us to fix it - we need to understand what is causing this. It is clear that there is some issue with data consistency, but normally restarting nodes should not be causing it. So I assume it has something to do with storage changes that you are performing.

1 Like

Hi Sergey, thanks appreciate the reply.

When I bring nodes into maintenance, for updates or other things - I drain them first.

I use Longhorn as storage provider and volumes on drained nodes need to be moved to another node. Depending on the size this takes a couple of minutes at most. For postgres this is less than 4 GB so its pretty quick. The pod needs to move as well after the volume has been migrated, so a new spins up a new pod and the old one is destroyed.

This usually works well, I do updates regularly. At one of these moments I got the errors below and the cluster wouldn’t come up anymore. I am not a DB Admin so I searched for the errors and tried some commands and solutions but it wouldn’t fix. It was quicker to restore, that takes me like 10 minutes since its documented well. Basically copy/paste and change some passwords here and there.

The last time, a node was taken offline before it was fully drained and all the volumes it held were marked degraded. The node didn’t crash, it just held a very large replica and it ran out of time so it simply took all replicas offline to reboot. If it held the master it should have spun up a new pod on a node with a healthy replica. I don’t know if you’re familiar with Longhorn but it stores 3 replicas across different nodes (I configured to disallow multiple replicas on the same node or disk). Basically, with one replica down there is no need for panic; 2 more nodes contain perfecet data. Still, the database became corrupt with the exact same messages as first time. I don’t know if the node that went down held the master pod, but it still should have cleanly shutdown the pod and flushed the disk.

So, yeh I don’t know. I can imagine the DB got corrupt the second time, but the first time the shutdowns were really clean.

I understand its not your job to educate me on database recovery matters :smiley: but I would love to get some tips on the errors below and how to handle them more gracefully. Searches and LLM’s didn’t help, unfortunately. Perplexity suggested disabling journaling and trigger an auto recovery but after an hour it was still showing the same messages in a loop.