We accidentally missed files from backrestrepo! We had cluster issues, those stopped database hard way, during startup database tries to restore
LOG
FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' for read
FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy' for read
HINT: archive.info cannot be opened but is required to push/get WAL segments.
HINT: is archive_command configured correctly in postgresql.conf?
HINT: has a stanza-create been performed?
HINT: use --no-archive-check to disable archive checks during backup if you have an alternate archiving scheme.
ERROR: [103]: unable to find a valid repository
2024-01-03 10:45:32.222 UTC [48874] FATAL: the database system is starting up
2024-01-03 10:45:33.243 UTC [48879] FATAL: the database system is starting up
2024-01-03 10:45:33.472 UTC [48881] FATAL: the database system is starting up
Please advice, can we somehow start it ignoring archive, backup steps , to get dump and do migration, clean setup with existing database content?
First task: locate and physically review your WALS
Next: review your postgres logs and identify specifically which WAL it stalls.
Finally: if you are indeed missing WALS then, under the right conditions it still possible to recover your STANDBY using pg_rewind.
With a little luck I’ve fully understood your situation. By all means fill in any details that I may not be totally clear about.
We missed replicas, and leader still present, but is displayed as replicas!
bash-4.4$ patronictl list
+ Cluster: zabbix-ha-db-pg-db (7251563696400589006) +---------+--------------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+-------------------------------------+-------------+---------+--------------+----+-----------+-----------------+
| zabbix-ha-db-pg-db-66d6b974df-r9tb7 | 10.2.217.14 | Replica | start failed | | unknown | * |
+-------------------------------------+-------------+---------+--------------+----+-----------+-----------------+
rewind gave us following result, since database remains in starting phase
pg_rewind -D /var/lib/postgresql/data/ --source-server="port=5432 user=postgres dbname=zabbix"
pg_rewind: fatal: connection to server on socket "/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is starting up
Following log was generated during startup on maintenance pod with the same data dir
2024-01-04 07:50:44.934 UTC [680] LOG: starting PostgreSQL 14.10 on x86_64-alpine-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2024-01-04 07:50:44.934 UTC [680] LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-01-04 07:50:44.936 UTC [680] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-01-04 07:50:44.941 UTC [682] LOG: database system was shut down in recovery at 2024-01-04 07:44:18 UTC
2024-01-04 07:50:44.941 UTC [682] WARNING: specified neither primary_conninfo nor restore_command
2024-01-04 07:50:44.941 UTC [682] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-01-04 07:50:44.941 UTC [682] LOG: entering standby mode
2024-01-04 08:00:21.468 UTC [707] FATAL: the database system is starting up
This log is from failed percona cluster leader node
Only difference is there is also archive and recovery commands in postgresql.conf includet
2024-01-04 08:07:03.542 UTC [1015918] FATAL: the database system is starting up
WARN: repo1: [FileMissingError] unable to load info file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' or '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy':
FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info' for read
FileMissingError: raised from remote-0 ssh protocol on 'zabbix-ha-db-pg-db-backrest-shared-repo': unable to open missing file '/backrestrepo/zabbix-ha-db-pg-db-backrest-shared-repo/archive/db/archive.info.copy' for read
HINT: archive.info cannot be opened but is required to push/get WAL segments.
HINT: is archive_command configured correctly in postgresql.conf?
HINT: has a stanza-create been performed?
HINT: use --no-archive-check to disable archive checks during backup if you have an alternate archiving scheme.
ERROR: [103]: unable to find a valid repository
2024-01-04 08:07:03.834 UTC [1015923] FATAL: the database system is starting up
Take a look at your configuration setup and look for these parameters, the above messages suggest there’s something at issue:
primary_conninfo
restore_command
TIP: Because your root problem lies with postgres you should simplify you environment. For example put patroni into maintenance mode and debug postgres manually.
We did similar, but using separate maintenance pod with the existing PostgreSQL data folder, for indeed much easier debugging and managed to start database and create dump!