PBM Fails incremental backup with: ERROR: check cluster for backup done: convergeCluster: backup on shard mongo_data_rs10 failed with: %!s(<nil>)

Hi,
We are seens some issues when running an pbm incremental base backup, the error message we get is:
ERROR: check cluster for backup done: convergeCluster: backup on shard mongo_data_rs10 failed with: %!s()

the backup logs:

2025-03-19T16:28:00Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/collection-3--2064123749121529916.wt [0:479232] 468.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-5--3410557875914912461.wt [0:253952] 248.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-70-1975402631693463105.wt [0:49152] 48.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-42411--2854269189489801028.wt [0:32768] 32.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-42591--2854269189489801028.wt [0:98304] 96.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-26--4382307725957457486.wt [0:4096] 4.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-42459--2854269189489801028.wt [0:512000] 500.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/index-23--3768745852357879697.wt [0:249856] 244.00KB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/storage.bson [0:114] 114.00B
2025-03-19T16:28:01Z I [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading data done
2025-03-19T16:28:01Z I [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading journals
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/journal/WiredTigerLog.0000031853 100.00MB
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading: /data/db/journal/WiredTigerLog.0000031854 100.00MB
2025-03-19T16:28:01Z I [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] uploading journals done
2025-03-19T16:28:01Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] stop cursor polling: <nil>, cursor err: <nil>
2025-03-19T16:28:02Z I [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] mark RS as error `waiting for done: backup stuck, last beat ts: 1742396422`: <nil>
2025-03-19T16:28:02Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] set balancer on
2025-03-19T16:28:02Z E [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] backup: waiting for done: backup stuck, last beat ts: 1742396422
2025-03-19T16:28:02Z D [mongo_data_rs1/mongodb_content_rs1n1:27017] [backup/2025-03-19T13:33:56Z] releasing lock

i cant seem to find anyhing relased to this anywhere, hope you can point me in the right direction or got a solution :slight_smile:

Hi, likely one of the agents had an issue, please check pbm agent logs in the other nodes for clues.

Thanks for the answer, just trying to find something in the agent log now, but also just found this in the backup log:

2025-03-19T15:00:23Z D [mongo_data_rs10/mongodb_content_rs10n2:27017] [backup/2025-03-19T13:33:56Z] stop cursor polling: <nil>, cursor err: connection pool for 127.0.0.1:27017 was cleared because another operation failed with:  connection(127.0.0.1:27017[-3153]) incomplete read of message header: read tcp 127.0.0.1:47900->127.0.0.1:27017: i/o timeout: connection(127.0.0.1:27017[-3153]) incomplete read of message header: read tcp 127.0.0.1:47900->127.0.0.1:27017: i/o timeout
2025-03-19T15:00:23Z I [mongo_data_rs10/mongodb_content_rs10n2:27017] [backup/2025-03-19T13:33:56Z] mark RS as error `upload file `/data/db/journal/WiredTigerLog.0000027827`: get file stat: stat /data/db/journal/WiredTigerLog.0000027827: no such file or directory`: <nil>
2025-03-19T15:00:23Z D [mongo_data_rs10/mongodb_content_rs10n2:27017] [backup/2025-03-19T13:33:56Z] set balancer on
2025-03-19T15:00:23Z E [mongo_data_rs10/mongodb_content_rs10n2:27017] [backup/2025-03-19T13:33:56Z] backup: upload file `/data/db/journal/WiredTigerLog.0000027827`: get file stat: stat /data/db/journal/WiredTigerLog.0000027827: no such file or directory
2025-03-19T15:00:23Z D [mongo_data_rs10/mongodb_content_rs10n2:27017] [backup/2025-03-19T13:33:56Z] releasing lock
Cluster:
========
mongo_data_rs1:
  - mongo_data_rs1/mongodb_content_rs1n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs1/mongodb_content_rs1n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs1/mongodb_content_rs1n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs8:
  - mongo_data_rs8/mongodb_content_rs8n1:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs8/mongodb_content_rs8n2:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs8/mongodb_content_rs8n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs3:
  - mongo_data_rs3/mongodb_content_rs3n1:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs3/mongodb_content_rs3n2:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs3/mongodb_content_rs3n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs2:
  - mongo_data_rs2/mongodb_content_rs2n1:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs2/mongodb_content_rs2n2:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs2/mongodb_content_rs2n3:27017 [!Arbiter]: arbiter node is not supported
mongo_conf:
  - mongo_conf/mongodb_content_cfg_server1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_conf/mongodb_content_cfg_server2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_conf/mongodb_content_cfg_server3:27017 [S]: pbm-agent v2.3.1 OK
mongo_data_rs10:
  - mongo_data_rs10/mongodb_content_rs10n1:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs10/mongodb_content_rs10n2:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs10/mongodb_content_rs10n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs12:
  - mongo_data_rs12/mongodb_content_rs12n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs12/mongodb_content_rs12n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs12/mongodb_content_rs12n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs7:
  - mongo_data_rs7/mongodb_content_rs7n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs7/mongodb_content_rs7n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs7/mongodb_content_rs7n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs9:
  - mongo_data_rs9/mongodb_content_rs9n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs9/mongodb_content_rs9n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs9/mongodb_content_rs9n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs5:
  - mongo_data_rs5/mongodb_content_rs5n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs5/mongodb_content_rs5n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs5/mongodb_content_rs5n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs4:
  - mongo_data_rs4/mongodb_content_rs4n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs4/mongodb_content_rs4n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs4/mongodb_content_rs4n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs11:
  - mongo_data_rs11/mongodb_content_rs11n1:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs11/mongodb_content_rs11n2:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs11/mongodb_content_rs11n3:27017 [!Arbiter]: arbiter node is not supported
mongo_data_rs6:
  - mongo_data_rs6/mongodb_content_rs6n1:27017 [P]: pbm-agent v2.3.1 OK
  - mongo_data_rs6/mongodb_content_rs6n2:27017 [S]: pbm-agent v2.3.1 OK
  - mongo_data_rs6/mongodb_content_rs6n3:27017 [!Arbiter]: arbiter node is not supported

dont know if that gives more insight in whats happening?

our oplog setting, is set to 200gb, the backup size is about 10tb

Hi very likely you are hitting a bug with oplog dump/upload. We’ve fixed a few of them since PBM 2.3.1 and made the process able to auto-retry if fails. I suggest you upgrade ASAP to latest PBM 2.9.0 and try again.

Thanks alot for the answer!

Can you tell me what version its fixed in? because we are running mongodb 5.0 right now, and cant just upgrade to 6,7,8 right away. so i could hope its fixed in pbm 2.7.0?

/Morten

Even in 2.7.0 I believe most of them are fixed so I suggest you upgrade. You can check the release notes for more info.