MongoDB graphs stopped working after 2.11.1 upgrade

Catoman · November 11, 2020, 7:33am

Hi,

Yesterday i upgraded the server to 2.11.1 from 2.9.1 and added the same version client to the repos. It was installed on three MongoDB servers this morning at 06:10

At that time i see some graphs have stopped working, for example under MongoDB ReplSet Summary, Max Member Ping Time, OpLog Recovery Window and Max Heartbeat Time have no data after 06:10 and Replication Lag in the ReplicaSet is now 136 years . But other graphs look ok.

Can’t see anything obvious in the logs and have tried restarting the PMm services on the nodes but no result.

Akira_Kurogane · November 11, 2020, 6:47pm

Hi Catoman.

The replication lag value is a known bug - https://jira.percona.com/browse/PMM-6811 - fixed in 2.12.

The 2.9 → 2.10 PMM upgrade coincides with the release of the v0.10.x → v0.20 mongodb_exporter prometheus exporter, which is total rewrite of the mongodb_exporter. All the metric names changed, but a compatibility mode was included which duplicates metrics by their old names through mapping rules.

It looks as though we don’t have compatibility mappings for the metrics behind Max member ping time, max heartbeat time, and oplog recovery window. I’ll create a Jira ticket.

Akira_Kurogane · November 11, 2020, 7:44pm

FWIW I think the new equation (using the new metric names) for ping ms will be:

mongodb_rs_members_pingMs{service_name=~“$service_name”}

This is not an average or max, just showing all for the ‘service’ (= the cluster replset in focus in the dashboard)

And for last heartbeat it can be:

time() - mongodb_members_lastHeartbeat{service_name=~“$service_name”}/1000

By they way these two graphs are not important ones i.m.o. Ping time (typically 1ms +/-1) sets a minimum for replication lag, but the maximum replication lag (the thing that matters) is measured at best in 1-second resolution. The 1 second resolution is due to the mongodb timestamp type that doesn’t discriminate sub-second divisions in a wall-clock time sense. And replset heartbeats happen regularly every two seconds, if they don’t it’s because of bigger issues (network partition, node crash or overload so high it’s going to crash soon enough) that you’ll notice in other ways - no replset status for nodes, an election, etc.

Oplog window is a different matter, still looking into it.

Akira_Kurogane · November 11, 2020, 8:57pm

Jira ticket for the oplog window metric: https://jira.percona.com/browse/PMM-6927

Jira ticket for replset member ping time and latest heartbeat time: https://jira.percona.com/browse/PMM-6928

Catoman · November 12, 2020, 3:02am

Hi Akira,

Thanks for the info. I’ll wait for the upcoming fixes then

BR

Johan

Topic		Replies	Views
Problems after upgrading from 2.9.0 to 2.11.1 PMM 2.x pmm , mongodb	20	2717	August 31, 2022
Get mongodb metrics error PMM 2.x pmm , mongodb	18	3489	July 5, 2021
some graph in mongodb are not working PMM 1.x	1	530	November 19, 2016
No MongoDB metrics after upgrade from 2.9.1 PMM 2.x	3	694	April 16, 2021
MongoDB latency metrics PMM 2.x mongodb	20	3507	July 29, 2021

MongoDB graphs stopped working after 2.11.1 upgrade

Related topics