Mongod - "Detected unclean shutdown", never a clean shutdown

Hello

Seems there is never a clean shutdown even if pause = true|false is set. The result is that following line is printed and mongodb starts recovery:

{“t”:{“$date”:“2022-11-12T16:08:07.386+00:00”},“s”:“W”, “c”:“STORAGE”, “id”:22271, “ctx”:“initandlisten”,“msg”:“Detected unclean shutdown - Lock file is not empty”,“attr”:{“lockFile”:“/data/db/mongod.lock”}}

which takes in our case > 1h, which is not acceptable.

  1. is there away to properly shutdown ?

  2. if this is not the case, is there away to delete mongod.lock via kubernetes before mongod starts ?

Thanks
John

1 Like

More info

With 50k of collections, where things seem to work:

{"t":{"$date":"2022-11-13T15:08:10.075+00:00"},"s":"I",  "c":"STORAGE",  "id":22322,   "ctx":"SignalHandler","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2022-11-13T15:08:10.075+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"SignalHandler","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2022-11-13T15:08:10.075+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"SignalHandler","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2022-11-13T15:08:10.078+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1668352090:78792][1:0x7fa53e7ef700], close_ckpt: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 391, snapshot max: 391 snapshot count: 0, oldest timestamp: (1668352072, 1) , meta checkpoint timestamp: (1668352077, 1)"}}
{"t":{"$date":"2022-11-13T15:08:10.090+00:00"},"s":"I",  "c":"STORAGE",  "id":4795901, "ctx":"SignalHandler","msg":"WiredTiger closed","attr":{"durationMillis":15}}

{"t":{"$date":"2022-11-13T15:08:10.103+00:00"},"s":"I",  "c":"STORAGE",  "id":22279,   "ctx":"SignalHandler","msg":">>>shutdown: removing fs lock...<<<"}

{"t":{"$date":"2022-11-13T15:08:10.103+00:00"},"s":"I",  "c":"-",        "id":4784931, "ctx":"SignalHandler","msg":"Dropping the scope cache for shutdown"}
{"t":{"$date":"2022-11-13T15:08:10.103+00:00"},"s":"I",  "c":"FTDC",     "id":4784926, "ctx":"SignalHandler","msg":"Shutting down full-time data capture"}
{"t":{"$date":"2022-11-13T15:08:10.103+00:00"},"s":"I",  "c":"FTDC",     "id":20626,   "ctx":"SignalHandler","msg":"Shutting down full-time diagnostic data capture"}
{"t":{"$date":"2022-11-13T15:08:10.106+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"SignalHandler","msg":"Now exiting"}
{"t":{"$date":"2022-11-13T15:08:10.106+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"SignalHandler","msg":"Shutting down","attr":{"exitCode":0}}

With 500k collections, where things seem not work anymore:

{"t":{"$date":"2022-11-13T18:55:55.101+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"SignalHandler","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2022-11-13T18:55:55.148+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn88","msg":"Connection ended","attr":{"remote":"10.12.3.1:22553","connectionId":88,"connectionCount":1}}
{"t":{"$date":"2022-11-13T18:55:55.233+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"SignalHandler","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2022-11-13T18:55:55.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.82.238:27017"}}
{"t":{"$date":"2022-11-13T18:55:57.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.82.238:27017"}}
{"t":{"$date":"2022-11-13T18:55:59.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.82.238:27017"}}
{"t":{"$date":"2022-11-13T18:55:59.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:01.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:03.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:05.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:07.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:09.462+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn159850","msg":"Connection ended","attr":{"remote":"10.12.3.1:51944","connectionId":159850,"connectionCount":0}}
{"t":{"$date":"2022-11-13T18:56:09.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:11.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}
{"t":{"$date":"2022-11-13T18:56:13.959+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"34.65.90.245:27017"}}

=> seems not able shutdown to the end

1 Like

Good catch.

We will need to dig deeper and see how this can be improved.

Created this one: [K8SPSMDB-814] Unclean shutdown when pausing the cluster - Percona JIRA

2 Likes

Would be really great to have a fix or workaround - thanks a lot !

1 Like

More logs regarding the shutdown. It looks like the first pod to shutdown - rs-2 - is shutting down correctly whereas rs-1 and rs-0 do not shut down properly.

For rs-0 and rs-1 one gets during the next startup:

{“t”:{“$date”:“xxxxx”},“s”:“W”, “c”:“STORAGE”, “id”:22271, “ctx”:“initandlisten”,“msg”:“Detected unclean shutdown - Lock file is not empty”,“attr”:{“lockFile”:“/data/db/mongod.lock”}}

rs-logs.txt (40.9 KB)

1 Like