Zstd sync is crashing the database and restart from scratch

Sai_Teja_Varma · May 10, 2023, 11:23am

Hi Team,

In order to facilitate the implementation of ZSTD on MongoDB, the ZSTD machine has been added as a hidden node for the sync up. Nevertheless, when the sync starts, it eventually loses connection with the source, the source restarts, and it changes to a different node, at which point the ZSTD sync process starts over from scratch. For the past three days, it has been happening continuously.

{"t":{"$date":"2023-05-07T00:07:03.157+00:00"},"s":"I",  "c":"NETWORK",  "id":20125,   "ctx":"ReplCoordExtern-25","msg":"DBClientConnection failed to receive message","attr":{"connString":"qa.aws.local:27017","error":"HostUnreachable: Connection closed by peer"}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"E",  "c":"INITSYNC", "id":21149,   "ctx":"ReplCoordExtern-25","msg":"Collection clone failed","attr":{"namespace":"DB.Collection","error":"HostUnreachable: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"W",  "c":"INITSYNC", "id":21060,   "ctx":"ReplCoordExtern-25","msg":"Database clone failed","attr":{"dbName":"DB","dbNumber":267,"totalDbs":2072,"error":"InitialSyncFailure: HostUnreachable: Error cloning collection 'DB.Collection' :: caused by :: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.158+00:00"},"s":"I",  "c":"INITSYNC", "id":21183,   "ctx":"ReplCoordExtern-25","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"InitialSyncFailure: HostUnreachable: Error cloning collection 'DB.Collection' :: caused by :: network error while attempting to run command 'collStats' on host 'qa.aws.local:27017' "}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"ReplCoordExtern-26","msg":"Trying to reconnect","attr":{"connString":"qa.aws.local:27017 failed"}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20121,   "ctx":"ReplCoordExtern-26","msg":"Reconnect attempt failed","attr":{"connString":"qa.aws.local:27017 failed","error":""}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"NETWORK",  "id":20127,   "ctx":"ReplCoordExtern-26","msg":"DBClientCursor::init call() failed"}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"INITSYNC", "id":21181,   "ctx":"ReplCoordExtern-26","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"CallbackCanceled: oplog fetcher shutting down","lastFetched":"{ ts: Timestamp(1683418021, 9), t: 333 }"}}
{"t":{"$date":"2023-05-07T00:07:03.159+00:00"},"s":"I",  "c":"INITSYNC", "id":21191,   "ctx":"ReplCoordExtern-26","msg":"Initial sync attempt finishing up"}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn329","msg":"Connection ended","attr":{"remote":"1.1.1.1:54962","connectionId":329,"connectionCount":1}}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"CONNPOOL", "id":22566,   "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Ending connection due to bad connection status","attr":{"hostAndPort":"qa.aws.local:27017","error":"HostUnreachable: Connection reset by peer","numOpenConns":1}}
{"t":{"$date":"2023-05-07T00:07:03.162+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"crs1","host":"qa.aws.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection reset by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"qa.aws.local:27017","success":false,"errorMessage":"HostUnreachable: Connection reset by peer"}}}}

@Ivan_Groenewold Could you please help me with it

Parag_Bhayani · May 11, 2023, 3:31pm

Hi Sai,

From the above log, it seems to be the network stability issue due to which the Host is unreachable. It is trying to reconnect but it is getting failed.
Also kindly let us know the Architecture of your current deployment and which mongodb version is been used.
Is this the same cluster which you have mentioned in other forum (Zstd sync up issue on new replica for mongoDB)?

Regards,
Parag

Sai_Teja_Varma · May 12, 2023, 5:22am

Hi Parag,

thank you for reply,

We are currently using t3.large (2 vCPU 8GB RAM) on Amazon Linux 2 AMI with mongo V4.14 and the other article is totally a different replicaset.

Will also check memory and CPU metrics of the server if they are spiking and will increase the instance type if required

Regards,
Sai Teja Varma
+91 9000852599

Sai_Teja_Varma · May 15, 2023, 6:46am

Hi Parag

Increased the resources to t3.xlarge (double to CPU and RAM) still we are getting the issue

earlier it synced up to 24GB and after increasing the instance type it syncing till 74 GB and then restarting

Can you please help me with it

Ahmed_Asim · July 25, 2023, 9:22pm

Hi @Sai_Teja_Varma ,

I hope you managed it fix it, if yes please share how? I’m struggling for two weeks now with this issue with zero progress , I really appreciate your reply

Topic		Replies	Views
Zstd sync up issue on new replica for mongoDB Percona Server for MongoDB percona	6	867	May 15, 2023
Error reattaching a zstd volume to snappy replica set Percona Server for MongoDB percona , mongodb	2	1149	May 19, 2023
Changing compression to zstd problem Percona Server for MongoDB	1	1072	November 2, 2022
Mongodb Intialsync failed at specific db collection Percona Operator for MongoDB	21	2313	August 3, 2023
ZSTD compression support for Percona Server for MongoDB 3.4 (working patch) Percona Server for MongoDB	2	1648	May 12, 2017

Zstd sync is crashing the database and restart from scratch

Related topics