Zstd sync is crashing the database and restart from scratch

Hi Team,

In order to facilitate the implementation of ZSTD on MongoDB, the ZSTD machine has been added as a hidden node for the sync up. Nevertheless, when the sync starts, it eventually loses connection with the source, the source restarts, and it changes to a different node, at which point the ZSTD sync process starts over from scratch. For the past three days, it has been happening continuously.

{"t":{"$date":"2023-05-07T00:07:03.157+00:00"},"s":"I",  "c":"NETWORK",  "id":20125,   "ctx":"ReplCoordExtern-25","msg":"DBClientConnection failed to receive message","attr":{"connString":"qa.aws.local:27017","error":"HostUnreachable: Connection closed by peer"}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"E",  "c":"INITSYNC", "id":21149,   "ctx":"ReplCoordExtern-25","msg":"Collection clone failed","attr":{"namespace":"DB.Collection","error":"HostUnreachable: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"W",  "c":"INITSYNC", "id":21060,   "ctx":"ReplCoordExtern-25","msg":"Database clone failed","attr":{"dbName":"DB","dbNumber":267,"totalDbs":2072,"error":"InitialSyncFailure: HostUnreachable: Error cloning collection 'DB.Collection' :: caused by :: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"I",  "c":"INITSYNC", "id":21183,   "ctx":"ReplCoordExtern-25","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"InitialSyncFailure: HostUnreachable: Error cloning collection 'DB.Collection' :: caused by :: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"ReplCoordExtern-26","msg":"Trying to reconnect","attr":{"connString":"qa.aws.local:27017 failed"}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20121,   "ctx":"ReplCoordExtern-26","msg":"Reconnect attempt failed","attr":{"connString":"qa.aws.local:27017 failed","error":""}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20127,   "ctx":"ReplCoordExtern-26","msg":"DBClientCursor::init call() failed"}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"INITSYNC", "id":21181,   "ctx":"ReplCoordExtern-26","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"CallbackCanceled: oplog fetcher shutting down","lastFetched":"{ ts: Timestamp(1683418021, 9), t: 333 }"}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"INITSYNC", "id":21191,   "ctx":"ReplCoordExtern-26","msg":"Initial sync attempt finishing up"}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn329","msg":"Connection ended","attr":{"remote":"1.1.1.1:54962","connectionId":329,"connectionCount":1}}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"CONNPOOL", "id":22566,   "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Ending connection due to bad connection status","attr":{"hostAndPort":"qa.aws.local:27017","error":"HostUnreachable: Connection reset by peer","numOpenConns":1}}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"crs1","host":"qa.aws.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection reset by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"qa.aws.local:27017","success":false,"errorMessage":"HostUnreachable: Connection reset by peer"}}}}

@Ivan_Groenewold Could you please help me with it

Hi Sai,

From the above log, it seems to be the network stability issue due to which the Host is unreachable. It is trying to reconnect but it is getting failed.
Also kindly let us know the Architecture of your current deployment and which mongodb version is been used.
Is this the same cluster which you have mentioned in other forum (Zstd sync up issue on new replica for mongoDB)?

Regards,
Parag

Hi Parag,

thank you for reply,

We are currently using t3.large (2 vCPU 8GB RAM) on Amazon Linux 2 AMI with mongo V4.14 and the other article is totally a different replicaset.

Will also check memory and CPU metrics of the server if they are spiking and will increase the instance type if required

Regards,
Sai Teja Varma
+91 9000852599

Hi Parag

Increased the resources to t3.xlarge (double to CPU and RAM) still we are getting the issue

earlier it synced up to 24GB and after increasing the instance type it syncing till 74 GB and then restarting

Can you please help me with it