What are the pros and cons of having etcd outside the PostgreSQL Database server. For one Primary and one Standby in DC and one standby in DR. Which is best option? Have 3 etcd either has separate VMs or in same PostgreSQL DB + Patroni server.
To have automatic DR how many etcd node we need. Please explain.
Hello Umashankar_Balasubra
Welcome to Percona forum!
It is best practice to keep ETCD on separate servers.
It have several benefits:
- You can use one ETCD cluster for multiple Patroni clusters,
- Patroni always needs to see half + 1 rounded down ETCD nodes to be able to acquire and keep the leader lock, and if you lose quorum (you can no longer see more than half of ETCD nodes) Patroni will fence the current leader and demote it to a replica, to prevent split brains. But if your ETCD is on separate servers, you are not losing quorum even if two out of three of your Patroni (PostgreSQL) nodes are down and you can still operate with only database server running.
- If Postgres and ETCD are on the same server, and Postgres will have to serve bigger load than usually making the server less responsive, for which ETCD can also contribute as it may be a bit I/O intensive at times, it may cause unwanted failover or fencing of leader node. In such situation, when healthcheck is not received on time for couple of times, Patroni will assume that ETCD is down, and if more than half nodes are less responsive, it will either fence the current leader, or perform new leader election, to find better server, that can respond faster. Having ETCD on separate servers eliminates that risk, it will always reply on time even if Patroni servers are overloaded.
Hope it helps!
Cheers,
Mateusz
Thanks, Mateuz for update. First 2 points are clear. In the documentation High Availability in PostgreSQL with Patroni - Percona Distribution for PostgreSQL
example of HA setup, they are using the etcd & Patroni in the same VM. Is there any best practice document/blog by percona to recommend for HA “It is best practice to keep ETCD on separate servers.”
I assume there will good Local communication, reduced network dependency as all Patroni + etcd + Postgres are in same server. How you are referring it “it may cause unwanted failover or fencing of leader node.”
Any update on my query.
Hello,
Local communication as you mentioned seems great, but if server will get hit with high load, and will start responding slower it will affect ETCD as well, and if ETCD liveprobes will start failing it may trigger fencing or failover as I mentioned.
It is because Patroni requires to receive response from ETCD on time, to make sure Patroni leader have a quorum, and can keep the leader lock. If the response is not received on time, it may trigger failover to another server which is not struggling with resources, because it is not able to keep the leader lock. And if the load is high on all db servers, causing ETCD to fail on all servers it may trigger fencing mechanism.
Those are not frequent scenarios, but they are possible to happen, especially on lower spec servers. That is why it is recommended to keep them separated if possible, but it is not a requirement.
As for the blog posts, I am not aware of any that states why ETCD should be installed on separate servers, but here is one that demonstrates Patroni administration and in the example it uses ETCD on separate servers. Just to show that this is the recommended approach, we even use it on test environments.
Hope it helps.
Thanks, Mateuz, for making it clear.