Description:
Primary node on the replica set has crashed for some reason and is unable to start. It scans the collection for index building and gets signal(15) to be killed, after which, it shuts down and repeats the process once again.
We are running percona/percona-server-mongodb:6.0.9-7.
Here are the logs from mongod of failing node:
{"t":{"$date":"2024-04-18T07:02:56.003+00:00"},"s":"I", "c":"-", "id":51773, "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":127049100,"total":234295627,"percent":54}}
{"t":{"$date":"2024-04-18T07:02:59.003+00:00"},"s":"I", "c":"-", "id":51773, "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":127904300,"total":234295627,"percent":54}}
{"t":{"$date":"2024-04-18T07:03:02.003+00:00"},"s":"I", "c":"-", "id":51773, "ctx":"initandlisten","msg":"progress meter","attr":{"name":"Index Build: scanning collection","done":129831800,"total":234295627,"percent":55}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I", "c":"CONTROL", "id":23377, "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I", "c":"CONTROL", "id":23378, "ctx":"SignalHandler","msg":"Signal was sent by kill(2)","attr":{"pid":0,"uid":0}}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I", "c":"CONTROL", "id":23381, "ctx":"SignalHandler","msg":"will terminate after current cmd ends"}
{"t":{"$date":"2024-04-18T07:03:03.453+00:00"},"s":"I", "c":"REPL", "id":4784900, "ctx":"SignalHandler","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":15000}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":4794602, "ctx":"SignalHandler","msg":"Attempting to enter quiesce mode"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"-", "id":6371601, "ctx":"SignalHandler","msg":"Shutting down the FLE Crud thread pool"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"COMMAND", "id":4784901, "ctx":"SignalHandler","msg":"Shutting down the MirrorMaestro"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"SHARDING", "id":4784902, "ctx":"SignalHandler","msg":"Shutting down the WaitForMajorityService"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"NETWORK", "id":20562, "ctx":"SignalHandler","msg":"Shutdown: going to close listening sockets"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"NETWORK", "id":4784905, "ctx":"SignalHandler","msg":"Shutting down the global connection pool"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"CONTROL", "id":4784906, "ctx":"SignalHandler","msg":"Shutting down the FlowControlTicketholder"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"-", "id":20520, "ctx":"SignalHandler","msg":"Stopping further Flow Control ticket acquisitions."}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":4784907, "ctx":"SignalHandler","msg":"Shutting down the replica set node executor"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"CONTROL", "id":4784908, "ctx":"SignalHandler","msg":"Shutting down the PeriodicThreadToAbortExpiredTransactions"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":4784909, "ctx":"SignalHandler","msg":"Shutting down the ReplicationCoordinator"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":5074000, "ctx":"SignalHandler","msg":"Shutting down the replica set aware services."}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"TenantMigrationDonorService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"ShardSplitDonorService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":5123006, "ctx":"SignalHandler","msg":"Shutting down PrimaryOnlyService","attr":{"service":"TenantMigrationRecipientService","numInstances":0,"numOperationContexts":0}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":21328, "ctx":"SignalHandler","msg":"Shutting down replication subsystems"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"W", "c":"REPL", "id":21409, "ctx":"SignalHandler","msg":"ReplicationCoordinatorImpl::shutdown() called before startup() finished. Shutting down without cleaning up the replication system"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"SHARDING", "id":4784910, "ctx":"SignalHandler","msg":"Shutting down the ShardingInitializationMongoD"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"REPL", "id":4784911, "ctx":"SignalHandler","msg":"Enqueuing the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"-", "id":4784912, "ctx":"SignalHandler","msg":"Killing all operations for shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"-", "id":4695300, "ctx":"SignalHandler","msg":"Interrupted all currently running operations","attr":{"opsKilled":4}}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"TENANT_M", "id":5093807, "ctx":"SignalHandler","msg":"Shutting down all TenantMigrationAccessBlockers on global shutdown"}
{"t":{"$date":"2024-04-18T07:03:03.454+00:00"},"s":"I", "c":"COMMAND", "id":4784913, "ctx":"SignalHandler","msg":"Shutting down all open transactions"}
{"t":{"$date":"2024-04-18T07:03:03.455+00:00"},"s":"I", "c":"REPL", "id":4784914, "ctx":"SignalHandler","msg":"Acquiring the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:04.333+00:00"},"s":"I", "c":"INDEX", "id":4984704, "ctx":"initandlisten","msg":"Index build: collection scan stopped","attr":{"buildUUID":null,"collectionUUID":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"totalRecords":131103284,"durationMillis":234000,"phase":"collection scan","collectionScanPosition":"131103284","readSource":"kNoTimestamp","error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}}}
{"t":{"$date":"2024-04-18T07:03:04.333+00:00"},"s":"I", "c":"STORAGE", "id":22206, "ctx":"initandlisten","msg":"Deferring table drop for index","attr":{"index":"_id_","namespace":"namespace.collection","uuid":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"ident":"index-0-1349749316917450711","commitTimestamp":{"$timestamp":{"t":0,"i":0}}}}
{"t":{"$date":"2024-04-18T07:03:04.506+00:00"},"s":"E", "c":"STORAGE", "id":21021, "ctx":"initandlisten","msg":"Could not build an _id index on collection","attr":{"namespace":"namespace.collection","uuid":{"uuid":{"$uuid":"578cce8a-3d2f-49c3-8e7a-32c0447b3ffd"}},"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"collection scan stopped. totalRecords: 131103284; durationMillis: 234000ms; phase: collection scan; collectionScanPosition: RecordId(131103284); readSource: kNoTimestamp :: caused by :: interrupted at shutdown"}}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"F", "c":"CONTROL", "id":20573, "ctx":"initandlisten","msg":"Wrong mongod version","attr":{"error":"UPGRADE PROBLEM: The data files need to be fully upgraded to version 4.4 before attempting a binary upgrade; see https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures for more details."}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":23139, "ctx":"initandlisten","msg":"Conflicting exit code at shutdown","attr":{"originalExitCode":0,"newExitCode":62}}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"INDEX", "id":4784915, "ctx":"SignalHandler","msg":"Shutting down the IndexBuildsCoordinator"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"NETWORK", "id":4784918, "ctx":"SignalHandler","msg":"Shutting down the ReplicaSetMonitor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"SHARDING", "id":4784921, "ctx":"SignalHandler","msg":"Shutting down the MigrationUtilExecutor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"ASIO", "id":22582, "ctx":"MigrationUtil-TaskExecutor","msg":"Killing all outstanding egress activity."}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"COMMAND", "id":4784923, "ctx":"SignalHandler","msg":"Shutting down the ServiceEntryPoint"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":4784925, "ctx":"SignalHandler","msg":"Shutting down free monitoring"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":4784927, "ctx":"SignalHandler","msg":"Shutting down the HealthLog"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":4784928, "ctx":"SignalHandler","msg":"Shutting down the TTL monitor"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":6278511, "ctx":"SignalHandler","msg":"Shutting down the Change Stream Expired Pre-images Remover"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":4784929, "ctx":"SignalHandler","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"I", "c":"CONTROL", "id":4784930, "ctx":"SignalHandler","msg":"Shutting down the storage engine"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22320, "ctx":"SignalHandler","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22321, "ctx":"SignalHandler","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22322, "ctx":"SignalHandler","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22323, "ctx":"SignalHandler","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":20282, "ctx":"SignalHandler","msg":"Deregistering all the collections"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22372, "ctx":"OplogVisibilityThread","msg":"Oplog visibility thread shutting down."}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22317, "ctx":"SignalHandler","msg":"WiredTigerKVEngine shutting down"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22318, "ctx":"SignalHandler","msg":"Shutting down session sweeper thread"}
{"t":{"$date":"2024-04-18T07:03:04.508+00:00"},"s":"I", "c":"STORAGE", "id":22319, "ctx":"SignalHandler","msg":"Finished shutting down session sweeper thread"}
{"t":{"$date":"2024-04-18T07:03:04.515+00:00"},"s":"I", "c":"STORAGE", "id":4795902, "ctx":"SignalHandler","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2024-04-18T07:03:04.886+00:00"},"s":"I", "c":"STORAGE", "id":4795901, "ctx":"SignalHandler","msg":"WiredTiger closed","attr":{"durationMillis":371}}
{"t":{"$date":"2024-04-18T07:03:05.156+00:00"},"s":"I", "c":"STORAGE", "id":22279, "ctx":"SignalHandler","msg":"shutdown: removing fs lock..."}
{"t":{"$date":"2024-04-18T07:03:05.169+00:00"},"s":"I", "c":"-", "id":4784931, "ctx":"SignalHandler","msg":"Dropping the scope cache for shutdown"}
{"t":{"$date":"2024-04-18T07:03:05.169+00:00"},"s":"I", "c":"CONTROL", "id":20565, "ctx":"SignalHandler","msg":"Now exiting"}
{"t":{"$date":"2024-04-18T07:03:05.170+00:00"},"s":"I", "c":"CONTROL", "id":23138, "ctx":"SignalHandler","msg":"Shutting down","attr":{"exitCode":0}}
As you can see from the logs, there is one interesting line:
{"t":{"$date":"2024-04-18T07:03:04.507+00:00"},"s":"F", "c":"CONTROL", "id":20573, "ctx":"initandlisten","msg":"Wrong mongod version","attr":{"error":"UPGRADE PROBLEM: The data files need to be fully upgraded to version 4.4 before attempting a binary upgrade; see https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures for more details."}}
Not sure why do we get this error, but perhaps the database was in the process of updating before the crash?
Also, the output of kubectl get psmdb perconacluster -o yaml -n perconamongodb
- lastTransitionTime: "2024-04-18T02:07:46Z"
status: "True"
type: initializing
- lastTransitionTime: "2024-04-18T02:08:02Z"
status: "True"
type: ready
- lastTransitionTime: "2024-04-18T05:14:29Z"
message: 'create pbm object: create PBM connection to perconamongodbcluster-rs0-0.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017,perconamongodbcluster-rs0-1.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017,perconamongodbcluster-rs0-2.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017:
setup a new backups db: ensure cmd collection: connection(perconamongodbcluster-rs0-0.perconamongodbcluster-rs0.perconamongodb.svc.cluster.local:27017[-269065])
socket was unexpectedly closed: EOF'
reason: ErrorReconcile
status: "True"
type: error
- lastTransitionTime: "2024-04-18T05:14:38Z"
status: "True"
type: initializing
- lastTransitionTime: "2024-04-18T05:17:12Z"
message: 'rs0: ready'
reason: RSReady
status: "True"
type: ready
- lastTransitionTime: "2024-04-18T05:17:12Z"
status: "True"
type: initializing
host: perconamongodbcluster-rs0.perconamongodb.svc.cluster.local
mongoImage: percona/percona-server-mongodb:6.0.9-7
mongoVersion: 6.0.9-7
observedGeneration: 64
pmmVersion: 2.39.0
ready: 2
replsets:
rs0:
initialized: true
ready: 2
size: 3
status: initializing
size: 3
state: initializing
This contains an interesting state transition, which I was not able to figure out. Maybe you could help me understand why did this happened? This is the exact time when problems with the database has started.