Hello,
I noticed that when the k8s API is unavailable, both primary and replicas in a PG cluster go down since Patroni cannot reach k8s.
Is it safe to enable DCS failsafe mode using this operator? Are there examples on how to do it with the operator?
Bumping this thread. It’s not possible to set failsafe_mode using the operator, though we’ve modified the operator to allow it.
My question: is operator is intended to be compatible or not with this mode?
Is there any progress now? Has it been verified whether the PG cluster is functioning properly when the K8S API is unavailable
Hi @Joshua_Sierles and @chang_junye,
This is a known issue tracked in K8SPG-429. You don’t need to modify the operator to enable failsafe_mode. The CR already supports it:
spec:
patroni:
dynamicConfiguration:
failsafe_mode: true
The dynamicConfiguration field is schemaless and passes any valid Patroni config through to patronictl edit-config without filtering.
I validated this on Percona PG Operator v2.8.2 (Patroni 4.0.3, PG 17.7). Without failsafe, the primary demotes to read-only within ~14 seconds of losing the K8s API. With failsafe enabled, the primary stays read-write by verifying all members are alive via direct REST API calls (POST /failsafe on port 8008).
Important caveat: enabling failsafe_mode alone is not enough. The failsafe heartbeats use pod hostnames, which are resolved by CoreDNS. CoreDNS depends on the K8s API. So when the API goes down, DNS breaks too, and the failsafe calls fail silently. You need NodeLocal DNSCache or equivalent so that pod DNS resolution works independently of the API server.
Also keep in mind that failsafe requires all Patroni members to respond (not quorum). If any replica is unreachable during the API outage, the primary still demotes. This is by design to prevent split-brain.
There is an internal initiative exploring the inclusion of etcd as a sidecar in operator images, which would eliminate the K8s API dependency for leader election entirely. That work is still in early stages.