Not the answer you need?
Register and ask your own question!

MongoDB graphs stopped working after 2.11.1 upgrade

CatomanCatoman ContributorCurrent User Role Patron

Hi,

Yesterday i upgraded the server to 2.11.1 from 2.9.1 and added the same version client to the repos. It was installed on three MongoDB servers this morning at 06:10

At that time i see some graphs have stopped working, for example under MongoDB ReplSet Summary, Max Member Ping Time, OpLog Recovery Window and Max Heartbeat Time have no data after 06:10 and Replication Lag in the ReplicaSet is now 136 years :-) . But other graphs look ok.

Can't see anything obvious in the logs and have tried restarting the PMm services on the nodes but no result.

Best Answer

Answers

  • Akira KuroganeAkira Kurogane Percona Percona Staff Role

    Hi Catoman.

    The replication lag value is a known bug - https://jira.percona.com/browse/PMM-6811 - fixed in 2.12.

    The 2.9 -> 2.10 PMM upgrade coincides with the release of the v0.10.x -> v0.20 mongodb_exporter prometheus exporter, which is total rewrite of the mongodb_exporter. All the metric names changed, but a compatibility mode was included which duplicates metrics by their old names through mapping rules.

    It looks as though we don't have compatibility mappings for the metrics behind Max member ping time, max heartbeat time, and oplog recovery window. I'll create a Jira ticket.

  • Akira KuroganeAkira Kurogane Percona Percona Staff Role

    FWIW I think the new equation (using the new metric names) for ping ms will be:

    mongodb_rs_members_pingMs{service_name=~"$service_name"}

    This is not an average or max, just showing all for the 'service' (= the cluster replset in focus in the dashboard)

    And for last heartbeat it can be:

    time() - mongodb_members_lastHeartbeat{service_name=~"$service_name"}/1000

    By they way these two graphs are not important ones i.m.o. Ping time (typically 1ms +/-1) sets a minimum for replication lag, but the maximum replication lag (the thing that matters) is measured at best in 1-second resolution. The 1 second resolution is due to the mongodb timestamp type that doesn't discriminate sub-second divisions in a wall-clock time sense. And replset heartbeats happen regularly every two seconds, if they don't it's because of bigger issues (network partition, node crash or overload so high it's going to crash soon enough) that you'll notice in other ways - no replset status for nodes, an election, etc.

    ----

    Oplog window is a different matter, still looking into it.

  • CatomanCatoman Contributor Current User Role Patron

    Hi Akira,

    Thanks for the info. I'll wait for the upcoming fixes then :-)

    BR

    Johan

Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.