MongoDB graphs stopped working after 2.11.1 upgrade

Hi,

Yesterday i upgraded the server to 2.11.1 from 2.9.1 and added the same version client to the repos. It was installed on three MongoDB servers this morning at 06:10

At that time i see some graphs have stopped working, for example under MongoDB ReplSet Summary, Max Member Ping Time, OpLog Recovery Window and Max Heartbeat Time have no data after 06:10 and Replication Lag in the ReplicaSet is now 136 years :slight_smile: . But other graphs look ok.

Canā€™t see anything obvious in the logs and have tried restarting the PMm services on the nodes but no result.

Hi Catoman.

The replication lag value is a known bug - https://jira.percona.com/browse/PMM-6811 - fixed in 2.12.

The 2.9 ā†’ 2.10 PMM upgrade coincides with the release of the v0.10.x ā†’ v0.20 mongodb_exporter prometheus exporter, which is total rewrite of the mongodb_exporter. All the metric names changed, but a compatibility mode was included which duplicates metrics by their old names through mapping rules.

It looks as though we donā€™t have compatibility mappings for the metrics behind Max member ping time, max heartbeat time, and oplog recovery window. Iā€™ll create a Jira ticket.

FWIW I think the new equation (using the new metric names) for ping ms will be:

mongodb_rs_members_pingMs{service_name=~ā€œ$service_nameā€}

This is not an average or max, just showing all for the ā€˜serviceā€™ (= the cluster replset in focus in the dashboard)

And for last heartbeat it can be:

time() - mongodb_members_lastHeartbeat{service_name=~ā€œ$service_nameā€}/1000

By they way these two graphs are not important ones i.m.o. Ping time (typically 1ms +/-1) sets a minimum for replication lag, but the maximum replication lag (the thing that matters) is measured at best in 1-second resolution. The 1 second resolution is due to the mongodb timestamp type that doesnā€™t discriminate sub-second divisions in a wall-clock time sense. And replset heartbeats happen regularly every two seconds, if they donā€™t itā€™s because of bigger issues (network partition, node crash or overload so high itā€™s going to crash soon enough) that youā€™ll notice in other ways - no replset status for nodes, an election, etc.


Oplog window is a different matter, still looking into it.

Jira ticket for the oplog window metric: https://jira.percona.com/browse/PMM-6927

Jira ticket for replset member ping time and latest heartbeat time: https://jira.percona.com/browse/PMM-6928

Hi Akira,

Thanks for the info. Iā€™ll wait for the upcoming fixes then :slight_smile:

BR

Johan