Replication lag for more than 10 hours

Hi Team,

node1 - primary
node 2 - secondary
node 3 - secondary

Node 1 and node 2 are in same datacenter and always in sync but node 3 is in different datacenter and we see it is always lagging for long hours from primary.

We have also increased the oplog to a quite large about 4TB and in our case lot of data gets loaded and modified every minute.

The memory and load average is also not occupying much and we have baremetals configured and even when we check with network team they say there is no issue between both the datacenters and after 35 hours of lag it is going to recovery state.

We have done initial sync twice → from startup2 it changes to secondary but still i see it is lagging for 2 hours and it continues to lag, I have also tried restarting the node but still lagging is increasing.

Please can anyone help here what else I need to do to make this sync again to primary as this is a production env?

Thanks in advance.

1 Like

Hi Mamantha.

It sounds like node 3 has lower capacity, or there is a network bottleneck between it and the other node (whichever it is syncing from. Probably the primary, you can find out easily looking in rs.status()).

If there is no disk usage strain on node 3, especially during the WiredTiger checkpoint that happens each minute, then that would point to it being a network bottleneck.

I haven’t considered CPU as a bottleneck because that’s rare with all the cores modern servers have now. But if node 3 has a low amount, say 4 or less cores, then CPU can be considered too.

1 Like