XTRADB Limitations wsrep_max_ws_size and wsrep_max_ws_rows

Hello,

we are trying to setup a xtradb-cluster as a replacement for several dedicated mysql servers using classic statement based replication. The current setup has some really big problems with replication lag and xtradb cluster sounded like the perfect solution to get rid of that.

So we bought 5 new servers with quite a nice hardware-setup:
384GB RAM
20 15k harddisks on an raidcontroller with 2 ssds for maxcache
1 fusion iodrive2 365GB with facebooks flashcache to add another superfast cache to the filesystem

Our plan was to setup the xtradb cluster and let it replicate from our current replication master. Then do some testing, send all read-requests to the cluster and finally switch over the write-requests to the cluster. Of course with two addidional ha-proxys in a ha setup.

Everything went as expected until a developer issued a statement on the replication master which modified every row in a table with about 6 million rows:


update databasename.tablename set LastModified= NOW() where LastModified is null

This statement lead to a deadlock and the cluster did not respond and replicate anymore.
Here are the log messages:


130502 11:21:23 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
130502 11:21:23 [ERROR] WSREP: rbr write fail, data_len: 0, 1026
130502 11:22:21 [ERROR] Slave SQL: Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'databasename'. Query: 'update databasename.tablename set LastModified= NOW() where LastModified is null', Error_code: 1213
130502 11:22:21 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
130502 11:22:21 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
130502 11:22:21 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
130502 11:22:21 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'repli.047016' position 69124278

We figured out that if we split the above statement in two equal statements and adding “limit 4000000” everything is as it should be:


update databasename.tablename set LastModified= NOW() where LastModified is null limit 4000000;
update databasename.tablename set LastModified= NOW() where LastModified is null limit 4000000;

But that cannot be the solution. We have growing tables as nearly everybody who uses a database. That means, that there are statements which may run now, but not in 2 months when the table reaches a not known size.

After doing some research we tried to adjust the settings for ‘wsrep_max_ws_rows’ and ‘wsrep_max_ws_size’ by adding the following lines to my.cnf to:


wsrep_max_ws_rows=4294967296
wsrep_max_ws_size=34359738368

This lead to these messages:


130503 16:39:49 [Warning] option 'wsrep_max_ws_rows': unsigned value 4294967296 adjusted to 1048576
130503 16:39:49 [Warning] option 'wsrep_max_ws_size': unsigned value 34359738368 adjusted to 4294967296

After running the statement , the cluster ended in a deadlock again.

Does that really mean, that xtradb cluster breaks if you try to update a large table?
Is there anything else we could do?
xtradb sounded pretty cool so far.

Are there any practical usable workarounds?

I can’t account for the deadlock errors,but that many rows and large transactions in general will be problematic with PXC. Those types of operations should be broken into much smaller chunks and done gradually.

This can be as simple as limit 1000 and a while loop until rows affected from the previous run goes to 0.