we are trying to setup a xtradb-cluster as a replacement for several dedicated mysql servers using classic statement based replication. The current setup has some really big problems with replication lag and xtradb cluster sounded like the perfect solution to get rid of that.
So we bought 5 new servers with quite a nice hardware-setup:
20 15k harddisks on an raidcontroller with 2 ssds for maxcache
1 fusion iodrive2 365GB with facebooks flashcache to add another superfast cache to the filesystem
Our plan was to setup the xtradb cluster and let it replicate from our current replication master. Then do some testing, send all read-requests to the cluster and finally switch over the write-requests to the cluster. Of course with two addidional ha-proxys in a ha setup.
Everything went as expected until a developer issued a statement on the replication master which modified every row in a table with about 6 million rows:
update databasename.tablename set LastModified= NOW() where LastModified is null
This statement lead to a deadlock and the cluster did not respond and replicate anymore.
Here are the log messages:
130502 11:21:23 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 130502 11:21:23 [ERROR] WSREP: rbr write fail, data_len: 0, 1026 130502 11:22:21 [ERROR] Slave SQL: Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'databasename'. Query: 'update databasename.tablename set LastModified= NOW() where LastModified is null', Error_code: 1213 130502 11:22:21 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 130502 11:22:21 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 130502 11:22:21 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 130502 11:22:21 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'repli.047016' position 69124278
We figured out that if we split the above statement in two equal statements and adding “limit 4000000” everything is as it should be:
update databasename.tablename set LastModified= NOW() where LastModified is null limit 4000000; update databasename.tablename set LastModified= NOW() where LastModified is null limit 4000000;
But that cannot be the solution. We have growing tables as nearly everybody who uses a database. That means, that there are statements which may run now, but not in 2 months when the table reaches a not known size.
After doing some research we tried to adjust the settings for ‘wsrep_max_ws_rows’ and ‘wsrep_max_ws_size’ by adding the following lines to my.cnf to:
This lead to these messages:
130503 16:39:49 [Warning] option 'wsrep_max_ws_rows': unsigned value 4294967296 adjusted to 1048576 130503 16:39:49 [Warning] option 'wsrep_max_ws_size': unsigned value 34359738368 adjusted to 4294967296
After running the statement , the cluster ended in a deadlock again.
Does that really mean, that xtradb cluster breaks if you try to update a large table?
Is there anything else we could do?
xtradb sounded pretty cool so far.
Are there any practical usable workarounds?