PITR nomination can be empty for one shard even when all agents are healthy

wanglong · May 11, 2026, 8:45am

!!!Pemisssion Denied to create a issue at pbm jira page，fallback to report a issue here.

PITR nomination can be empty for one shard even when all agents are healthy

Summary

In a sharded cluster, PBM can start PITR normally, but one shard may end up with an empty nomination list (n: [], ack: "") even though all nodes in that shard are healthy.

Observed

Logs:

[pitr] checking locks in the whole cluster
[pitr] init pitr meta on the first usage
[pitr] cluster is ready for nomination
[pitr] reconciling ready status from all agents
[pitr] agents in ready: 6; waiting for agents: 6
[pitr] cluster leader sets running status
[pitr] pitr nomination list for popla-58b7d9ccc6-shard-jg2: []

pbmAgents for the affected shard:

db.pbmAgents.find({ rs: "popla-58b7d9ccc6-shard-jg2" }).pretty()

All nodes are healthy:

PRIMARY/SECONDARY
repl_lag: 0
pbms.ok: true
nodes.ok: true
stors.ok: true

But pbmPITR contains:

{
  status: "running",
  n: [
    { rs: "...-config-server", n: [...], ack: "..." },
    { rs: "...-shard-tl7", n: [...], ack: "..." },
    { rs: "...-shard-jg2", n: [], ack: "" }
  ]
}

Expected

PBM should not switch to running unless every required shard has at least one candidate agent.

Actual

PBM proceeds with PITR nomination using a snapshot where one shard has an empty candidate list.

Impact

One shard gets no PITR nominee
PITR for that shard never starts
PITR metadata becomes inconsistent

Suspected cause

leadNomination() uses a one-time snapshot from ListSteadyAgents() and does not verify that every shard has at least one usable candidate before setting cluster status to running and writing nomination metadata.

wanglong · May 11, 2026, 8:47am

@Boris_Ilijic Please have a look.

Boris_Ilijic · May 11, 2026, 9:28am

Hello @wanglong ,
Thank you for reporting this.

I am suspecting that something is misconfigured in the above case. Can you please share the output of pbm status command?

wanglong · May 11, 2026, 9:57am

this is a unsteady bug.

the cluster has been deleted. and new cluster has no problem.

I can find only the following information and logs from the conversion context with AI

pbm status output says the popla-58b7d9ccc6-shard-jg2 agent nodes are like something bad, and sh.status output says the shard popla-58b7d9ccc6-shard-jg2 is healthy, and the config server has records:

[
  {
    _id: ObjectId('69fe8b1e2394ab948446481e'),
    n: 'popla-58b7d9ccc6-shard-jg2-0.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 2,
    str: 'SECONDARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307732, i: 3 }),
    e: ''
  },
  {
    _id: ObjectId('69fe8b1e2394ab948446481f'),
    n: 'popla-58b7d9ccc6-shard-jg2-2.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 1,
    str: 'PRIMARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307730, i: 1 }),
    e: ''
  },
  {
    _id: ObjectId('69fe8b1e2394ab9484464820'),
    n: 'popla-58b7d9ccc6-shard-jg2-1.popla-58b7d9ccc6-shard-jg2-headless.cloud-ns.svc.cluster.local:27017',
    rs: 'popla-58b7d9ccc6-shard-jg2',
    s: 2,
    str: 'SECONDARY',
    hdn: false,
    psv: false,
    arb: false,
    delay: 0,
    repl_lag: 0,
    pbms: { ok: true, e: '' },
    nodes: { ok: true, e: '' },
    stors: { ok: true, e: '' },
    hb: Timestamp({ t: 1778307732, i: 3 }),
    e: ''
  }
]

the pbm agent leader has logs:

[pitr] waiting for cluster ready status 2026-05-09T01:16:58.000+0000 D [pitr] start pitr config monitor 2026-05-09T01:16:58.000+0000 D [pitr] checking locks in the whole cluster 2026-05-09T01:16:58.000+0000 D [pitr] start pitr agent activity monitor 2026-05-09T01:16:58.000+0000 D [pitr] start pitr error monitor 2026-05-09T01:16:58.000+0000 D [pitr] start pitr hb 2026-05-09T01:16:58.000+0000 D [pitr] start pitr topo monitor 2026-05-09T01:17:13.000+0000 D [pitr] init pitr meta on the first usage 2026-05-09T01:17:13.000+0000 D [pitr] cluster is ready for nomination 2026-05-09T01:17:13.000+0000 D [pitr] reconciling ready status from all agents 2026-05-09T01:17:14.000+0000 D [pitr] waiting pitr nomination 2026-05-09T01:17:15.000+0000 D [pitr] agents in ready: 6; waiting for agents: 6 2026-05-09T01:17:15.000+0000 D [pitr] cluster leader sets running status 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-config-server: [[popla-58b7d9ccc6-config-server-1.popla-58b7d9ccc6-config-server-headless.-clou 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-shard-jg2: [] 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination list for popla-58b7d9ccc6-shard-tl7: [[popla-58b7d9ccc6-shard-tl7-0.popla-58b7d9ccc6-shard-tl7-headless.-cloud-ns.svc.clu 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination popla-58b7d9ccc6-config-server, set candidates [popla-58b7d9ccc6-config-server-1.popla-58b7d9ccc6-config-server-headless.- 2026-05-09T01:17:15.000+0000 D [pitr] pitr nomination popla-58b7d9ccc6-shard-tl7, set candidates popla-58b7d9ccc6-shard-tl7-0.popla-58b7d9ccc6-shard-tl7-headless.-cloud-ns.sv 2026-05-09T01:17:18.000+0000 D [pitr] skip after pitr nomination, probably started by another node

all agents from the shard `popla-58b7d9ccc6-shard-jg2 have the same logs:

026-05-09T05:24:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:24:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:26:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:27:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:29:19.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:31:19.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:31:34.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:33:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:33:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:35:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:36:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:38:04.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:38:19.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:40:19.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:40:34.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:42:34.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:42:49.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:44:49.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:45:04.000+0000 D [pitr] waiting for cluster ready status
2026-05-09T05:47:04.000+0000 E [pitr] init: wait nomination for pitr: confirming ready status: timeout while waiting for ready status
2026-05-09T05:47:19.000+0000 D [pitr] waiting for cluster ready status

Boris_Ilijic · May 11, 2026, 10:38am

For some reason PBM didn’t nominate any member of popla-58b7d9ccc6-shard-jg2 replicaset.

[pitr] agents in ready: 6; waiting for agents: 6

The above line confirms that PBM sees just 6 agents, and in the cluster there are 9 nodes (3RS x 3nodes, it seems).

pbm status output says the popla-58b7d9ccc6-shard-jg2 agent nodes are like something bad

If that was the case, the behavior is expected, but we need to know what type of error was that.

In any case, it’s hard to say anything more without requested information. If the same issue pops up again please share: pbm status and relevant logs.

Boris_Ilijic · May 11, 2026, 12:40pm

Hi again,

It was possible to reproduce something similar (the root cause is the same), so please trace the progress of it here: https://perconadev.atlassian.net/browse/PBM-1765

Topic		Replies	Views
PITR stuck after full backup Percona Backup for MongoDB percona	7	557	January 9, 2026
PITR restore fails - sharding status of collection xyz is not currently known and needs to be recovered Percona Backup for MongoDB closed-no-reply	1	837	November 12, 2025
Mongodb backup errors - percona backup Percona Backup for MongoDB	7	2337	April 15, 2021
Pitr restore working different way in two ways Percona Operator for MongoDB	14	969	May 21, 2024
Another convergence Timeout Percona Backup for MongoDB	2	271	July 29, 2024

PITR nomination can be empty for one shard even when all agents are healthy

PITR nomination can be empty for one shard even when all agents are healthy

Summary

Observed

Expected

Actual

Impact

Suspected cause

Related topics