This is good topic to open a discussion about. MongoDB’s replica sets means that regular maintenance doesn’t have to cause any downtime for the database clients, but (important) you must not take down the majority of nodes in any given replica set at a given time.
Rule of ‘Minority maximum death’?
Before stopping and restarting any mongod node, or the host it is on, the administration script/program needs to determine what the other nodes in the same replica set are and confirm the majority (i.e. two, for the typical three-node replica set) are healthy before stopping that node. ‘Healthy’ = either PRIMARY, or SECONDARY with no noticeable replication lag.
Parallelization for the win
You can do replica sets in parallel so it doesn’t take longer in theory if you have one replica set or five hundred. Of course in practice you’ll probably stagger launches of the update procedure so there isn’t too much to be watched at any given moment.
The tricky parts
The tricky part is maintaining ‘Minority maximum death’ as the script/program runs
I think the root issue behind this is devops tools are built without a sense of distributed data. They’re made to run per server using only what state they can sense local to the server.
Parallel launch in the same replica set can easily lead to race conditions. For example a local script running on hosts A, B and C of the same replicaset might first run a check ‘Are the other hosts healthy PRIMARY or SECONDARY state now?’. If they’re done in parallel at the starting moment the answer will be true. Then A, B and C all restart in another moment, taking down the entire replicaset simultaneously.
Safety mechanisms have to be programmed. In short I think a semaphore of some kind should be put into the replica set’s own data so external agents can see which node is schedule to be restarted at any given moment. But at this point we should start determining some concrete points about your case.
- What is being patched? The entire host server, or just the mongod binaries? (The former is what I assume.)
- What is the sys admin method? Bash scripts and passwordless SSH? A devops tool like Puppet, Ansible, etc? Program using API of a computing service like AWS or whole-real-bare-metal system like Ubuntu’s MAAS?