What we want to establish is a redundant setup spread over our two availability zones. With a automatic failover when a server or a availability zone fails.
The zone hosting the HA Proxy servers is a stretch over both availability zones (network and hosting).
Requirements:
• no manual interference needed for failover
• database stays writable during failover
Need help in configuring the setup. Any config example in this regard would be helpful and validating the design .
I am attaching the High level design of what we want.
We are planning on having two HAproxy servers, one in each datacenter and in each DC a set of 2 Patroni / PostgreSQL server, where Patroni ensures the HA part, together with etcd.
Need help in validating the design and how to configure the setup ,especially replication among intrasite and intersite postgres nodes.
@Ishan_Chawla
You can setup the Standby Cluster on other DC [DC2] which will be syncing asynchronously from the Primary Cluster [DC1] although for the failover you have to rely on the manual approach as automatic promotion is not possible across the DC.
https://patroni.readthedocs.io/en/latest/ha_multi_dc.html#asynchronous-replication
Well if you use a single cluster(Primary) while the nodes allocated over different DC/Network then you can achieve such automatic failover thingy.
https://patroni.readthedocs.io/en/latest/ha_multi_dc.html#synchronous-replication
You have to remove the **standby cluster changes ** from the stand-by configuration - /etc/patroni/cluster2-0.yml in order to perform the manual failover.
-standby_cluster:
- host: xxx
- port: 5432
- primary_slot_name: standbyclust
The point is doing such failover especially from one network/zone to another could to lead to performance issues, network instability or replication lag/stall data problem. The stand-by should be consider for a disaster recovery scenario.
Are you considering using the stand-By[DC2] for some traffic or it is just for the DR solution ?
The other layers Haproxy/VIP should work fine if they able to recognize the DC2 once the Stand-by Promoted to Primary Leader.
Note - If your application belongs to different network/DC then the latency problem could arise while connecting to the DC2 so take this also in the consideration. If you have the application deployed on same DR then this would not be a blocker.