Open table locks incrementing on one cluster node, requires mysqld cycle to correct

Hi there,

We’ve been having trouble with a 3 node galera cluster, where seemingly out of nowhere one of the nodes will decide it doesn’t like a query and hang indefinitely. It will lock a table and hold that table. At this point, all tables start to lock and all queries end up creating new table locks. The process list will simply accumulate processes, existing processes will increment their time, and the open tables will just keep incrementing. Killing the process (or all processes) does not help. The only way to recover seems to be to cycle mysqld.

Here’s an example of one hung query from SHOW PROCESSLIST. I’ll spare the full list, unless you think it would help.

| 8 | xxxx_system_user | localhost | xxxx_system | Killed | 49634 | updating | UPDATE leases SET version=version+1,holder_id=1511216099643476245,expires=NOW() + INTERVAL 6930000 M | 0 | 0 |

Here’s a related row from SHOW OPEN TABLES.

| xxxx_system | leases | 3 | 0 |

And here’s a bit from the SHOW ENGINE INNODB STATUS that looks suspect:

—TRANSACTION 421965858011080, not started sleeping before entering InnoDB
mysql tables in use 1, locked 1
0 lock struct(s), heap size 1136, 0 row lock(s)
—TRANSACTION 421965858013496, not started sleeping before entering InnoDB
mysql tables in use 1, locked 1
0 lock struct(s), heap size 1136, 0 row lock(s)
—TRANSACTION 421965858009872, not started sleeping before entering InnoDB
mysql tables in use 1, locked 1
0 lock struct(s), heap size 1136, 0 row lock(s)
—TRANSACTION 421965858008664, not started sleeping before entering InnoDB
mysql tables in use 1, locked 1

We’re using pxc_strict_mode=ENFORCING and we’re using a read/write splitting proxy (MaxScale).

We have tried everything we know of, to both avoid the situation, as well as to recover from it without a restart.

Please let me know if any other information would be helpful in this case.

Cheers,
Clinton

We can recreate this very easily now. Using a simple bash scrips and the MySQL CLI, we can create the kind of contention that will result in an unrecoverable open table lock.

  • 3 node PXC cluster
  • One table
  • One row
  • An update statement to increment a value on the row
  • 3 MySQL clients, each connected to a different PXC node
  • Each client running that same update statement to increment the value, as fast as they can (about 2000 qps total across the three)

Percona will eventually reach an unrecoverable open table lock state. I can even drop and recreate the table and it will still fail to recover the locks. The only resolution is to restart mysql.

Luckily we can design our app to avoid writing to the multiple masters. But performance considerations aside, should PXC not survive this? We would fully expect performance to be bad and even deadlocks to occur (thus affecting certain individual transactions).

But for the database node to completely lock up and be unrecoverable seems like a bug.

Thoughts? What other information can I provide?

I tried the said scenario with 5.7.19 but I couldn’t reproduce the said issue.

Base data:
create database test; use test;
create table t (i int, primary key pk(i)) engine=innodb;
insert into t values (1);

mysqlslap --create-schema=“test” -q “update test.t set i = i + 1” -S /tmp/n1.sock -u root --concurrency=1 --iterations=20000000 &
mysqlslap --create-schema=“test” -q “update test.t set i = i + 1” -S /tmp/n2.sock -u root --concurrency=1 --iterations=20000000 &

One of the mysqlslap fails with “WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction” as expected.

Can you re-confirm the test-case and also share the configuration options.