PXC8.0 MDL conflict during operations CREATE/DROP USER or GRANT


There is a cluster of 3 version nodes:

Server version: 8.0.33-25.1 Percona XtraDB Cluster (GPL), Release rel25, Revision 0c56202, WSREP version
  • wsrep_OSU_method | TOI
  • wsrep_log_conflicts | ON

When executing commands: CREATE/DROP USER or GRANT
the server freezes and on all nodes there are errors in the logs:

[Note] [MY-000000] [WSREP] MDL conflict db= table= ticket=10 solved by abort

In this case it helps:

  1. stop nodes 2 and 3: systemctl stop mysql
  2. start 2 nodes: systemctl start mysql
  3. after SST 2 nodes start 3 nodes: systemctl start mysql
  4. after SST 3 nodes, the cluster was restored

In some cases the commands: CREATE/DROP USER or GRANT
can be successfully completed or performed with a strong delay, for example, 1.5 minutes or more.

The commands are executed on 1 node and there are no similar commands in the cluster, i.e. without parallel execution.

This behavior has not previously been observed with such commands.

Tell me what the problem is and how to fix it?

If you switch to wsrep_OSU_method=NBO, do you get the same behavior? Are there any other table locks happening during the CREATE USER? During the stall, can you run SHOW ENGINE INNODB STATUS?

I haven’t tested it with the wsrep_OSU_method=NBO parameter.
Will it be enough to do it at the Session level? And then execute the commands?

I’ll check it out and come back again.

Thanks for the answer!

You can’t see the status because everything freezes until you turn off 2 nodes.

Yes, SET SESSION wsrep_osu_method=NBO then run the CREATE USER.

Any other information in the mysql error log during the stalls?

mysql> SET SESSION wsrep_osu_method=NBO;
Query OK, 0 rows affected (0,00 sec)

mysql> DROP USER `petrov.s`@`%`;
ERROR 1235 (42000): This version of MySQL doesn't yet support 'this query in wsrep_OSU_method NBO'

log error.log:

2024-06-25 18:03:30	
2024-06-25T15:03:30.088700Z 1327819 [ERROR] [MY-000000] [WSREP] Fail to replicate: DROP USER `petrov.s`@`%`

When deleting, conflicts arose again.
You can’t see anything else, because… the server hangs.

Very strange behavior.

Can you put together a repeatable test case? I’d like to try it. If I can repeat it as well, then we can open a bug report.

The test case is like this:
There is a user table
You need to do one of the following:

CREATE USER 'test_user'@'%' IDENTIFIED BY 'password';
GRANT SELECT ON `db_test`.* TO 'test_user'@'%';
DROP USER `petrov.s`@`%`;

It doesn’t matter what user, privileges or database.

When responding to such requests, the cluster on all nodes has errors and freezes.
It helps to stop all nodes, bootstrap and connect one node at a time.

Not long ago they started using role: https://dev.mysql.com/doc/refman/8.0/en/roles.html
Could this be related?

I’ll try to repeat this on a trust cluster of the same version.

I have some hanging issues if i run all selects in one command with flush privelegies ant the end. No issues if run comand one be one with few seconds between them and before flush privelegies

Yes, only one command is executed. Several are given, because on any of them leads to problems.
This behavior did not exist before.
When downloading the pt-show-grants utility, everything is identical.
What else can you check?
Can using roles have this effect?

Could it be related to a bug [PXC-4315] - Percona JIRA?
I am studying the updates and planning to upgrade. After that, check the commands.

Hello @shigaev.s,
I created a 3node PXC 8.0.35 and I was unable to repeat this. I ran the create/drop use as you provided and I did not experience any cluster stalls at all. Can you please upgrade to the latest PXC and see if your issue remains?

1 Like

Yesterday we upgraded to version 8.0.36-28 (04/03/2024).
Today we tested several commands: REVOKE, GRANT, DROP USER - no more MDL conflict and cluster freeze.
It seems that the bug fix in [PXC-4315] - Percona JIRA helped.

We’ll leave it for testing for now.
I’ll come back later and if it doesn’t happen again, I’ll mark the ticket as resolved.

The problem repeated itself.
On the DROP USER command, the entire cluster froze, no logs, no metrics.
Moreover, the user was eventually deleted.
I checked this after restoring the cluster.
It is sad. The reason is not known and how to debug it.

Any ideas?

Hello @shigaev.s,
This fix will be in 8.0.37

1 Like

Hello @matthewb ,
will wait.

If you can create a coredump while the server is hung, that would help us.
(kill -11) and create a jira ticket, uploading the compressed coredump file.

1 Like

THIS blogpost can be helpful.
When the server is stuck, just kill it with kill -11. It will cause core file creation.

1 Like

I’ll try this if it fails.