MongoDB graphs stopped working after 2.11.1 upgrade

Hi,

Yesterday i upgraded the server to 2.11.1 from 2.9.1 and added the same version client to the repos. It was installed on three MongoDB servers this morning at 06:10

At that time i see some graphs have stopped working, for example under MongoDB ReplSet Summary, Max Member Ping Time, OpLog Recovery Window and Max Heartbeat Time have no data after 06:10 and Replication Lag in the ReplicaSet is now 136 years :slight_smile: . But other graphs look ok.

Can’t see anything obvious in the logs and have tried restarting the PMm services on the nodes but no result.

Hi Catoman.

The replication lag value is a known bug - https://jira.percona.com/browse/PMM-6811 - fixed in 2.12.

The 2.9 -> 2.10 PMM upgrade coincides with the release of the v0.10.x -> v0.20 mongodb_exporter prometheus exporter, which is total rewrite of the mongodb_exporter. All the metric names changed, but a compatibility mode was included which duplicates metrics by their old names through mapping rules.

It looks as though we don’t have compatibility mappings for the metrics behind Max member ping time, max heartbeat time, and oplog recovery window. I’ll create a Jira ticket.

FWIW I think the new equation (using the new metric names) for ping ms will be:

mongodb_rs_members_pingMs{service_name=~"$service_name"}

This is not an average or max, just showing all for the ‘service’ (= the cluster replset in focus in the dashboard)

And for last heartbeat it can be:

time() - mongodb_members_lastHeartbeat{service_name=~"$service_name"}/1000

By they way these two graphs are not important ones i.m.o. Ping time (typically 1ms +/-1) sets a minimum for replication lag, but the maximum replication lag (the thing that matters) is measured at best in 1-second resolution. The 1 second resolution is due to the mongodb timestamp type that doesn’t discriminate sub-second divisions in a wall-clock time sense. And replset heartbeats happen regularly every two seconds, if they don’t it’s because of bigger issues (network partition, node crash or overload so high it’s going to crash soon enough) that you’ll notice in other ways - no replset status for nodes, an election, etc.


Oplog window is a different matter, still looking into it.

Jira ticket for the oplog window metric: https://jira.percona.com/browse/PMM-6927

Jira ticket for replset member ping time and latest heartbeat time: https://jira.percona.com/browse/PMM-6928

Hi Akira,

Thanks for the info. I’ll wait for the upcoming fixes then :slight_smile:

BR

Johan