Backup failed with error: "upload to GCS: 502 Bad Gateway"

Hi guys,
I’m facing this problem when performing backup on my MongoDB sharded cluster lately! Here is the status and backup log. Hope you guys can help me solve this problem, thank you.

Cluster:
========
configRS:
  - configRS/b2b-mongo-1:57017: pbm-agent v1.7.0 OK
  - configRS/b2b-mongo-3:57017: pbm-agent v1.7.0 OK
  - configRS/b2b-mongo-4:57017: pbm-agent v1.7.0 OK
myShard_0:
  - myShard_0/b2b-mongo-1:37017: pbm-agent v1.7.0 OK
  - myShard_0/b2b-mongo-2:37017: pbm-agent v1.7.0 OK
  - myShard_0/b2b-mongo-4:37017: pbm-agent v1.7.0 OK
myShard_1:
  - myShard_1/b2b-mongo-2:47017: pbm-agent v1.7.0 OK
  - myShard_1/b2b-mongo-3:47017: pbm-agent v1.7.0 OK
  - myShard_1/b2b-mongo-4:47017: pbm-agent v1.7.0 OK


PITR incremental backup:
========================
Status [ON]

Currently running:
==================
(none)

Backups:
========
S3 asia-east1 s3://https://storage.googleapis.com/omd-mongodb/pbm/backup
  Snapshots:
    2022-06-22T18:00:01Z 0.00B [ERROR: check cluster for dump done: convergeCluster: backup on shard myShard_1 failed with: ] [2022-06-22T20:06:34]
    2022-06-19T18:00:01Z 0.00B [ERROR: check cluster for dump done: convergeCluster: backup on shard myShard_0 failed with: ] [2022-06-19T20:08:13]
    2022-06-15T18:00:01Z 0.00B [ERROR: check cluster for dump done: convergeCluster: backup on shard myShard_1 failed with: ] [2022-06-15T20:02:22]
    2022-06-12T18:00:01Z 476.81GB <logical> [complete: 2022-06-12T20:11:59]
    2022-06-08T18:00:01Z 479.73GB <logical> [complete: 2022-06-08T20:17:03]
    2022-06-05T18:00:01Z 472.58GB <logical> [complete: 2022-06-05T20:10:38]
    2022-06-01T18:00:01Z 484.08GB <logical> [complete: 2022-06-01T20:17:48]
    2022-05-28T18:00:01Z 485.72GB <logical> [complete: 2022-05-28T20:24:22]
  PITR chunks [249.29GB]:
    2022-05-28T20:24:23 - 2022-06-16T02:30:5
2022-06-22T18:00:02Z I [myShard_1/b2b-mongo-3:47017] [backup/2022-06-22T18:00:01Z] backup started
2022-06-22T18:00:02Z I [myShard_0/b2b-mongo-2:37017] [backup/2022-06-22T18:00:01Z] backup started
2022-06-22T18:00:02Z I [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] backup started
2022-06-22T18:00:07Z I [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] mongodump finished, waiting for the oplog
2022-06-22T20:06:33Z I [myShard_1/b2b-mongo-3:47017] [backup/2022-06-22T18:00:01Z] dropping tmp collections
2022-06-22T20:06:33Z I [myShard_1/b2b-mongo-3:47017] [backup/2022-06-22T18:00:01Z] mark RS as error `mongodump: write data: upload to GCS: 502 Bad Gateway.`: <nil>
2022-06-22T20:06:33Z E [myShard_1/b2b-mongo-3:47017] [backup/2022-06-22T18:00:01Z] backup: mongodump: write data: upload to GCS: 502 Bad Gateway.
2022-06-22T20:06:34Z I [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] dropping tmp collections
2022-06-22T20:06:34Z I [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] mark RS as error `check cluster for dump done: convergeCluster: backup on shard myShard_1 failed with: `: <nil>
2022-06-22T20:06:34Z I [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] mark backup as error `check cluster for dump done: convergeCluster: backup on shard myShard_1 failed with: `: <nil>
2022-06-22T20:06:34Z E [configRS/b2b-mongo-1:57017] [backup/2022-06-22T18:00:01Z] backup: check cluster for dump done: convergeCluster: backup on shard myShard_1 failed with: 
2022-06-22T20:41:53Z I [myShard_0/b2b-mongo-2:37017] [backup/2022-06-22T18:00:01Z] mongodump finished, waiting for the oplog
2022-06-22T20:41:54Z I [myShard_0/b2b-mongo-2:37017] [backup/2022-06-22T18:00:01Z] dropping tmp collections
2022-06-22T20:41:54Z I [myShard_0/b2b-mongo-2:37017] [backup/2022-06-22T18:00:01Z] mark RS as error `waiting for dump done: backup stuck, last beat ts: 1655928392`: <nil>
2022-06-22T20:41:54Z E [myShard_0/b2b-mongo-2:37017] [backup/2022-06-22T18:00:01Z] backup: waiting for dump done: backup stuck, last beat ts: 1655928392
1 Like

Hello, as you can see the problem is communication issue uploading to GCS. Did you test all servers on all shards can write to GCS? Have you checked logs on the Google side?

2 Likes

Hi Igroene, the problem goes away after upgrading from PBM v1.6.1 to v1.7.0 and adding retryer to the config. I think it’s because the backup set is relatively large and the network is unstable can cause the write to GCS to fail.
For anyone facing the same problem, here is what I add to the config after upgrading to v1.7.0:

storage:
...
    retryer:
      numMaxRetries: 10
      minRetryDelay: 30
      maxRetryDelay: 5

I’m not sure if this is the permanent fix but it is worth a try.

1 Like