Intermittent handshake failures after enabling Istio Ambient Mode

Environment

  • Kubernetes: EKS (AWS)
  • Percona PSMDB Operator: 1.21.1
  • Percona Server for MongoDB: 7.0.24-13
  • Percona Backup for MongoDB (PBM): 2.11.0
  • Istio: 1.28.3 (Ambient mode, no sidecars)
  • Replica set: 3 members (psmdb-db-rs0-{0,1,2}), sharding disabled
  • TLS: preferTLS

Summary

We have a PSMDB cluster that was fully functional before we introduced Istio ambient mode into the namespace. After enabling Istio ambient (ztunnel-based L4 mesh, no sidecars), the PSMDB operator and the backup agent started experiencing intermittent connection handshake failures to replica set members. The replica set itself remains healthy (3/3 ready), but the operator’s reconciliation loop and backup operations fail intermittently.

Symptoms

1. Operator reconciliation errors

The PerconaServerMongoDB CR flips between ready and error state. The error message is always a connection handshake failure to one of the RS members (the target member varies — sometimes rs0-0, sometimes rs0-2):

Status:
  Backup Config Hash:  9623d82b4e716743357ab93e2cf249a013a23dc3b843e51c7e49d
  Backup Image:        percona/percona-backup-mongodb:2.11.0
  Backup Version:      2.11.0
  Conditions:
    Last Transition Time:  2026-02-06T18:52:35Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2026-02-06T19:03:32Z
    Message:               update PiTR config: create pbm object: create PBM connection to psmdb-db-rs0-0.psmdb-db-rs0.ns1.svc.cluster.local:27017,psmdb-db-rs0-1.psmdb-db-rs0.ns1.svc.cluster.local:27017,psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017: create mongo connection: ping: connection() error occurred during connection handshake: handshake failure:  connection(psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017[-2911361]) socket was unexpectedly closed: EOF: connection(psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017[-2911361]) socket was unexpectedly closed: EOF
    Reason:                ErrorReconcile
    Status:                True
    Type:                  error
    Last Transition Time:  2026-02-06T19:21:17Z
    Status:                True
    Type:                  ready
    Last Transition Time:  2026-02-16T19:57:16Z
    Status:                False
    Type:                  sharding
  Host:                    psmdb-db-rs0.ns1.svc.cluster.local
  Message:                 Error: dial: ping mongo: connection() error occurred during connection handshake: handshake failure:  connection(psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017[-2911376]) socket was unexpectedly closed: EOF: connection(psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017[-2911376]) socket was unexpectedly closed: EOF
  Mongo Image:             percona/percona-server-mongodb:7.0.24-13
  Mongo Version:           7.0.24-13
  Observed Generation:     9
  Ready:                   3
  Replsets:
    rs0:
      Initialized:  true
      Ready:        3
      Size:         3
      Status:       ready
  Size:             3
  State:            error

The replset is fully operational (Ready: 3, Status: ready), all application connections via the service work fine, but the operator intermittently can’t reach individual pod hostnames during reconciliation.

2. Backup failures

On-demand and scheduled backups (PBM logical backup to S3-compatible storage) fail with similar errors. The backup agent successfully dumps most collections and uploads them to storage, but ultimately fails because PBM loses heartbeat connectivity to a RS member:

2026-02-16T20:07:13 I [backup/...] dump finished, waiting for the oplog
2026-02-16T20:07:13 I [backup/...] mark backup as error
    `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1771272383`
2026-02-16T20:07:13 E [backup/...] backup: check cluster for dump done:
    convergeCluster: lost shard rs0, last beat ts: 1771272383

The underlying connection errors during backup are consistently:

connection() error occurred during connection handshake: handshake failure:
connection(psmdb-db-rs0-2...[-259]) socket was unexpectedly closed: EOF

Questions

  1. Is the PSMDB operator known to work with Istio ambient mode? Are there any recommended configurations or known incompatibilities with ztunnel traffic interception?

  2. Are there recommended Istio PeerAuthentication / DestinationRule configurations for PSMDB clusters running in an Istio ambient mesh?

Any guidance on running Percona PSMDB operator in an Istio ambient mode environment would be greatly appreciated.

Hi @shepz ,

Since PSMDB handles its own encryption (Intra-cluster TLS), you should explicitly tell Istio to stay out of the way for the MongoDB ports. This prevents “double-mTLS” which is the most common cause of these EOF errors. Maybe you can try it with PeerAuthentication. Here’s something that might work (I haven’t tried)

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: psmdb-disable-mtls
  namespace: ns1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: psmdb-db # Match your CR name
  mtls:
    mode: PERMISSIVE

Setting this to PERMISSIVE allows the Operator and PBM to communicate using their native TLS without the ztunnel forcing an incompatible mTLS wrapper.

If the PeerAuthentication does not stabilize the PBM heartbeats, you may need to bypass the ztunnel entirely for the MongoDB data port (27017) - or the other that you use.

Let me know if that moved you forward.