PostgresSQL failed to start with no "archive.info" file

Hi,

We accidentally missed files from backrestrepo! We had cluster issues, those stopped database hard way, during startup database tries to restore

LOG

FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' for read
      FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy' for read
      HINT: archive.info cannot be opened but is required to push/get WAL segments.
      HINT: is archive_command configured correctly in postgresql.conf?
      HINT: has a stanza-create been performed?
      HINT: use --no-archive-check to disable archive checks during backup if you have an alternate archiving scheme.
ERROR: [103]: unable to find a valid repository
2024-01-03 10:45:32.222 UTC [48874] FATAL:  the database system is starting up
2024-01-03 10:45:33.243 UTC [48879] FATAL:  the database system is starting up
2024-01-03 10:45:33.472 UTC [48881] FATAL:  the database system is starting up

Please advice, can we somehow start it ignoring archive, backup steps , to get dump and do migration, clean setup with existing database content?

Thanks in advance

1 Like

Hi,

I’ll be honest, this has me concerned …

ERROR: [103]: unable to find a valid repository

First task: locate and physically review your WALS
Next: review your postgres logs and identify specifically which WAL it stalls.
Finally: if you are indeed missing WALS then, under the right conditions it still possible to recover your STANDBY using pg_rewind.

With a little luck I’ve fully understood your situation. By all means fill in any details that I may not be totally clear about.

Hope this helps.

regards

Robert

Hello,

Thanks for quick replay!

We missed replicas, and leader still present, but is displayed as replicas!

bash-4.4$ patronictl list
+ Cluster: zabbix-ha-db-pg-db (7251563696400589006) +---------+--------------+----+-----------+-----------------+
| Member                              | Host        | Role    | State        | TL | Lag in MB | Pending restart |
+-------------------------------------+-------------+---------+--------------+----+-----------+-----------------+
| zabbix-ha-db-pg-db-66d6b974df-r9tb7 | 10.2.217.14 | Replica | start failed |    |   unknown | *               |
+-------------------------------------+-------------+---------+--------------+----+-----------+-----------------+

rewind gave us following result, since database remains in starting phase

pg_rewind -D /var/lib/postgresql/data/ --source-server="port=5432 user=postgres dbname=zabbix"
pg_rewind: fatal: connection to server on socket "/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is starting up

Following log was generated during startup on maintenance pod with the same data dir

2024-01-04 07:50:44.934 UTC [680] LOG:  starting PostgreSQL 14.10 on x86_64-alpine-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2024-01-04 07:50:44.934 UTC [680] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-01-04 07:50:44.936 UTC [680] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-01-04 07:50:44.941 UTC [682] LOG:  database system was shut down in recovery at 2024-01-04 07:44:18 UTC
2024-01-04 07:50:44.941 UTC [682] WARNING:  specified neither primary_conninfo nor restore_command
2024-01-04 07:50:44.941 UTC [682] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-01-04 07:50:44.941 UTC [682] LOG:  entering standby mode
2024-01-04 08:00:21.468 UTC [707] FATAL:  the database system is starting up

This log is from failed percona cluster leader node

Only difference is there is also archive and recovery commands in postgresql.conf includet

2024-01-04 08:07:03.542 UTC [1015918] FATAL:  the database system is starting up
WARN: repo1: [FileMissingError] unable to load info file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' or '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy':
      FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' for read
      FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy' for read
      HINT: archive.info cannot be opened but is required to push/get WAL segments.
      HINT: is archive_command configured correctly in postgresql.conf?
      HINT: has a stanza-create been performed?
      HINT: use --no-archive-check to disable archive checks during backup if you have an alternate archiving scheme.
ERROR: [103]: unable to find a valid repository
2024-01-04 08:07:03.834 UTC [1015923] FATAL:  the database system is starting up

Hi again,

This has caught my attention.

Take a look at your configuration setup and look for these parameters, the above messages suggest there’s something at issue:

  • primary_conninfo
  • restore_command

TIP: Because your root problem lies with postgres you should simplify you environment. For example put patroni into maintenance mode and debug postgres manually.

Hope this helps.

Hi Robert.

Thanks for support,

We did similar, but using separate maintenance pod with the existing PostgreSQL data folder, for indeed much easier debugging and managed to start database and create dump!

Thanks a lot
Regards,
Vitaly