Need Clarification: Best Practice for pgBackRest Configuration in Multi-DC Patroni Standby Cluster

Hello everyone,

I’m looking for some guidance and best practices regarding the configuration of pgBackRest in a multi-data center Patroni PostgreSQL setup.

Current Setup:

  • Two data centers (DC1 & DC2)

    • DC1: Hosts the primary Patroni cluster and a pgBackRest repository server (repo1)

    • DC2: Hosts a standby Patroni cluster (configured for synchronous replication)

  • In DC2, the standby cluster retrieves WAL archives from the pgBackRest server in DC1 using a recovery command.

Concern:

My concern is about high availability of archived WAL files. If the pgBackRest repository server in DC1 (repo1) becomes unavailable, the standby Patroni cluster in DC2 won’t be able to fetch WALs, which would interrupt replication.
Alternatively, configuring the standby to fetch WALs directly from the primary PostgreSQL nodes only works until those WALs are deleted (due to retention policies), which is also risky.

Question:

  1. What is the recommended design for ensuring the standby Patroni cluster in DC2 always has access to the required WAL files, even if DC1’s pgBackRest repository server goes down?

  2. Is it best practice to maintain a pgBackRest repository server in both DCs and set up synchronous/asynchronous mirroring between them?

  3. Are there any other reliable approaches to ensure WAL availability and minimize data loss or downtime in this scenario?

Any architecture diagrams, example configurations, or experiences from similar environments would be greatly appreciated!

Thank you!

You have two data centers (DC1 & DC2)
DC1: Hosts the primary Patroni cluster and a pgBackRest repository server (repo1)
DC2: Hosts a standby Patroni cluster (configured for synchronous replication)

Preliminary Questions:

  • Have I understood correctly that you have two ETCD clusters for each datacentre i.e. 3 nodes per datacentre?
  • Is your DC2 designed meant to be a standby datacentre that is promoted in the case that DC1 fails?
  • What does it mean when you say that DC2 is configured for synchronous replication?
  • Am I to assume that you are NOT using a STANDBY LEADER configuration between DC1 and DC2 and if so then why not?

What is the recommended design for ensuring the standby Patroni cluster in DC2
always has access to the required WAL files,
even if DC1’s pgBackRest repository server goes down?

One method would be to place your pgbackrest service in DC2 and pull in the logs from DC1 instead. And pointing the pgbackrest service through a gateway router, haproxy/keepalived/VIP, always points to the read-write PRIMARY would solve a lot of issues when there’s a datacentre failover to the standby DC2. And yes there would be latency issues which is another consideration.

Is it best practice to maintain a pgBackRest repository server
in both DCs and set up synchronous/asynchronous mirroring between them?

Nope, too much risk of inconsistency

Are there any other reliable approaches to ensure WAL availability
and minimize data loss or downtime in this scenario?

You need to describe the business rules i.e. what’s the purpose of DC2?

Answering the aforementioned questions will help better direct next steps.

Hope this helps.