OPTIMIZE TABLE cluster lock-up

Last night I did some maintenance work on our production cluster. I wanted to regain some space from a bulk delete operation. I already knew that OPTIMIZE TABLE is not online when using PXC but I tried it on our test system and it finished in a reasonable time. No Problem to do it in the middle of the night.
On the production system there was still some traffic going on. After issuing the command all nodes started rebuilding the table at the same time because TOI mode was set. I have three nodes. I issued the command on node1. Node2 finished first and was fully operational after 8 minutes. Node1 and Node2 showed multiple “waiting for metadata lock” messages. Also on the OPTIMIZE TABLE command. The had the old and the new table in their datadir. Both with the same size. CPU load of MySQL was jumping around between 100 and 300%.
After FC kicked in I killed node1 and node3 to get out of trouble. On restart of node1 and node3 they did a SST.

Out of curiosity I optimized the table again with “OPTIMIZE LOCAL TABLE” one node after the other. This lead to no problems at all. Not even hanging transactions on the node where the process was running. The expected behavior of an online optimize.

Do you have any explanation for this behavior?

If you plan to run a DDL query (OPTIMIZE is) you need to check the queries that are already running there. Let me explain why. Let’s say the table to optimize is called “tickets”. And there is a long running query there, that will need 20 minutes to finish. That SELECT is not blocking other queries, so everything works as expected. When you run OPTIMIZE it will have to wait, since it needs metadata lock to be able to run, and that lock cannot be acquired if there is query accessing it. So, the OPTIMIZE waits to the long SELECT to finish. The problem is that if there is a query waiting for the metadata lock, all the new queries that comes after it will also have to wait, because this is a queue. You have:

  • Long SELECT running for 20 minutes.
  • OPTIMIZE waiting on metadata lock.
  • All other queries that try to access the table will have to wait until OPTIMIZE finishes.

In other words, all your queries trying to access that table will be locked during 20 minutes + the time OPTIMIZE will need to run.

When such queries like OPTIMIZE or ANALYZE are executed, you need to find that there are no other long running queries accessing the same table, to avoid this problem.

Regards.

I know but we had no long running queries. All queries were showing that they are waiting for metadata lock, including the optimize queries. (Our longest runs for 0.3 seconds)
Total deadlock situation.
Two nodes hanging like that with the optimized table nearly finished in a #sql file and two cores fully loaded… One node operating normaly.
​​​​​​Maybe I should mention that we are writing to all nodes.
This would never have finished. I also checked disk io on the dead nodes. Nothing going on at all.

Regards