PITR nomination can be empty for one shard even when all agents are healthy

!!!Pemisssion Denied to create a issue at pbm jira page,fallback to report a issue here.

PITR nomination can be empty for one shard even when all agents are healthy

Summary

In a sharded cluster, PBM can start PITR normally, but one shard may end up with an empty nomination list (n: [], ack: "") even though all nodes in that shard are healthy.

Observed

Logs:

[pitr] checking locks in the whole cluster
[pitr] init pitr meta on the first usage
[pitr] cluster is ready for nomination
[pitr] reconciling ready status from all agents
[pitr] agents in ready: 6; waiting for agents: 6
[pitr] cluster leader sets running status
[pitr] pitr nomination list for popla-58b7d9ccc6-shard-jg2: []

pbmAgents for the affected shard:

db.pbmAgents.find({ rs: "popla-58b7d9ccc6-shard-jg2" }).pretty()

All nodes are healthy:

  • PRIMARY/SECONDARY
  • repl_lag: 0
  • pbms.ok: true
  • nodes.ok: true
  • stors.ok: true

But pbmPITR contains:

{
  status: "running",
  n: [
    { rs: "...-config-server", n: [...], ack: "..." },
    { rs: "...-shard-tl7", n: [...], ack: "..." },
    { rs: "...-shard-jg2", n: [], ack: "" }
  ]
}

Expected

PBM should not switch to running unless every required shard has at least one candidate agent.

Actual

PBM proceeds with PITR nomination using a snapshot where one shard has an empty candidate list.

Impact

  • One shard gets no PITR nominee
  • PITR for that shard never starts
  • PITR metadata becomes inconsistent

Suspected cause

leadNomination() uses a one-time snapshot from ListSteadyAgents() and does not verify that every shard has at least one usable candidate before setting cluster status to running and writing nomination metadata.

@Boris_Ilijic Please have a look.

Hello @wanglong ,
Thank you for reporting this.

I am suspecting that something is misconfigured in the above case. Can you please share the output of pbm status command?

this is a unsteady bug.

the cluster has been deleted. and new cluster has no problem.

I can find only the following information and logs from the conversion context with AI :rofl:

pbm status output says the popla-58b7d9ccc6-shard-jg2 agent nodes are like something bad, and sh.status output says the shard popla-58b7d9ccc6-shard-jg2 is healthy, and the config server has records:

[
  {
    _id: ObjectId('69fe8b1e2394ab948446481e'),
    n: 'popla-58b7d9ccc6-shard-jg2-0.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 2,
    str: 'SECONDARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307732, i: 3 }),
    e: ''
  },
  {
    _id: ObjectId('69fe8b1e2394ab948446481f'),
    n: 'popla-58b7d9ccc6-shard-jg2-2.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 1,
    str: 'PRIMARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307730, i: 1 }),
    e: ''
  },
  {
    _id: ObjectId('69fe8b1e2394ab9484464820'),
    n: 'popla-58b7d9ccc6-shard-jg2-1.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 2,
    str: 'SECONDARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307732, i: 3 }),
    e: ''
  }
]

the pbm agent leader has logs:

[pitr] waiting for cluster ready status 2026-05-09T01:16:58.000+0000 D [pitr] start pitr config monitor 2026-05-09T01:16:58.000+0000 D [pitr] checking locks in the whole cluster 2026-05-09T01:16:58.000+0000 D [pitr] start pitr agent activity monitor 2026-05-09T01:16:58.000+0000 D [pitr] start pitr error monitor 2026-05-09T01:16:58.000+0000 D [pitr] start pitr hb 2026-05-09T01:16:58.000+0000 D [pitr] start pitr topo monitor 2026-05-09T01:17:13.000+0000 D [pitr] init pitr meta on the first usage 2026-05-09T01:17:13.000+0000 D [pitr] cluster is ready for nomination 2026-05-09T01:17:13.000+0000 D [pitr] reconciling ready status from all agents 2026-05-09T01:17:14.000+0000 D [pitr] waiting pitr nomination 2026-05-09T01:17:15.000+0000 D [pitr] agents in ready: 6; waiting for agents: 6 2026-05-09T01:17:15.000+0000 D [pitr] cluster leader sets running status 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-config-server: [[popla-58b7d9ccc6-config-server-1.popla-58b7d9ccc6-config-server-headless.-clou 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-shard-jg2: [] 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-shard-tl7: [[popla-58b7d9ccc6-shard-tl7-0.popla-58b7d9ccc6-shard-tl7-headless.-cloud-ns.svc.clu 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination popla-58b7d9ccc6-config-server, set candidates [popla-58b7d9ccc6-config-server-1.popla-58b7d9ccc6-config-server-headless.- 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination popla-58b7d9ccc6-shard-tl7, set candidates popla-58b7d9ccc6-shard-tl7-0.popla-58b7d9ccc6-shard-tl7-headless.-cloud-ns.sv 2026-05-09T01:17:18.000+0000 D [pitr] skip after pitr nomination, probably started by another node

all agents from the shard `popla-58b7d9ccc6-shard-jg2 have the same logs:

026-05-09T05:24:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:24:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:26:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:27:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:29:19.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:31:19.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:31:34.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:33:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:33:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:35:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:36:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:38:04.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:38:19.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:40:19.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:40:34.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:42:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:42:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:44:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:45:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:47:04.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:47:19.000+0000 D [pitr] waiting for cluster ready status

For some reason PBM didn’t nominate any member of popla-58b7d9ccc6-shard-jg2 replicaset.

[pitr] agents in ready: 6; waiting for agents: 6

The above line confirms that PBM sees just 6 agents, and in the cluster there are 9 nodes (3RS x 3nodes, it seems).

pbm status output says the popla-58b7d9ccc6-shard-jg2 agent nodes are like something bad

If that was the case, the behavior is expected, but we need to know what type of error was that.

In any case, it’s hard to say anything more without requested information. If the same issue pops up again please share: pbm status and relevant logs.

Hi again,

It was possible to reproduce something similar (the root cause is the same), so please trace the progress of it here: https://perconadev.atlassian.net/browse/PBM-1765