Galera inconsistent data?

I’ve been doing some testing of my application against Percona-XtraDB-Cluster-server-56-5.6.28-25.14.1.el7.x86_64,
and have run across something I cannot explain, and it appears to only occur under load.

I’ve got a load balancer sitting in front of 3 nodes directing traffic in a round robin fashion (Linux IPVS/LVS using server load
balancing features in keepalived). My application spawns 25 DB connections, resulting in 8-9 connections per DB node.

Essentially my application is doing something like:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN;
SELECT var FROM foo WHERE id = ? FOR UPDATE;

Do some math on ‘var’

UPDATE foo SET var=? WHERE id=?;
COMMIT;

“id” is the primary key in this case, so basically doing the same thing as this blog post:
[URL=“Support for MySQL Transaction Isolation Levels in Galera Cluster | Galera Cluster for MySQL”]http://galeracluster.com/2015/09/sup...alera-cluster/[/URL]
I understand “SERIALIZABLE” isn’t really supported, but I’d hope Galera wouldn’t downgrade it
to worse than REPEATABLE READ or something.

What I am observing is “N” concurrent committers get the same select result (expected with galera), do their math,
update with math result which succeeds (expected with galera), then commit and all succeed (unexpected).
I would have expected all but one committer to deadlock (1213) and I’d need to retry the transaction on the
deadlocked nodes. I do get some occasional deadlocks reported, but not as many as I should. This leads to
inconsistent results since the math performed is now off … needless to say this causes major issues.

If I attempt to reproduce this with the mysql command line tool, I cannot, all but one case results in a deadlock as
expected.

A few other notes:

  1. The math operation performed may or may not be the same for each transaction, it depends on the operation,
    but some may indeed appear to return the same result (Does this matter?)
  2. The number of committers (“N”) here could be as high as the number of connections, in theory, but in
    practice it is probably 1-3 … but this statement during a load test is executed a few dozen times per
    second.
  3. This works flawlessly if all traffic is directed to a single node in the cluster, so it is definitely something introduced
    by using Galera.
  4. The value of wsrep_sync_wait appears to have zero effect (values 0, 1, 3 tried)
  5. I could attempt to change the “UPDATE” statement to include the original value of ‘foo’ that is expected
    to try to catch this condition, but I’m not actually sure if it would depending on how the record updates
    are transmitted. I have not tried that, but wouldn’t think it should be necessary.
  6. I’m 90% sure we tested this same load scenario a year or so ago and this worked fine, could there be
    a regression in newer versions? I haven’t yet tried to roll back to something older.
  7. I haven’t yet tried to create a reduced test case, I wanted to first ask to see if anyone else had seen
    what I’m seeing.

And finally, I’m sure everyone wants my DB configs …

/etc/my.cnf:
[mysqld]
datadir = /var/lib/mysql

move tmpdir due to /tmp being a memory backed tmpfs filesystem, mysql uses this for on disk sorting

tmpdir = /var/lib/mysql/tmp

[mysqld_safe]
pid-file = /run/mysqld/mysql.pid
syslog
!includedir /etc/my.cnf.d

/etc/my.cnf.d/base.cnf:
[mysqld]
bind-address = 0.0.0.0
key_buffer = 256M
max_allowed_packet = 16M
max_connections = 256

Some optimizations

thread_concurrency = 10
sort_buffer_size = 2M
query_cache_limit = 100M
query_cache_size = 256M
log_bin
binlog_format = ROW
gtid_mode = ON
log_slave_updates
enforce_gtid_consistency = 1
group_concat_max_len = 102400
innodb_buffer_pool_size = 10G
innodb_log_file_size = 64M
innodb_file_per_table = 1
innodb_file_format = barracuda
default_storage_engine = innodb

SSD Tuning

innodb_flush_neighbors = 0
innodb_io_capacity = 6000

/etc/my.cnf.d/cluster.cnf:

Galera cluster

[mysqld]
wsrep_provider = /usr/lib64/libgalera_smm.so
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = “sstuser:s3cretPass”
wsrep_cluster_name = cluster
wsrep_slave_threads = 32
wsrep_max_ws_size = 2G
wsrep_provider_options = “gcache.size = 5G; pc.recovery = true”
wsrep_cluster_address = gcomm://10.30.30.11,10.30.30.12,10.30.30.13
wsrep_sync_wait = 0
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog = 1
innodb_flush_log_at_trx_commit = 0
sync_binlog = 0
innodb_support_xa = 0
innodb_flush_method = ALL_O_DIRECT

[sst]
progress = 1
time = 1
streamfmt = xbstream

Thanks!
-Brad

I’ve written a test case for this, though I’m having problems attaching it to here, its not clear what the rules are for attachments, its just a .c file.

Anyhow, one thing I neglected to mention previously is I’m actually getting a deadlock on a couple of connections as well that never return results. If I restart the mysql daemon itself, it unlocks the client side so it can error out.

We’re in the process of trying to roll back to older versions with this test case we created to see if this is a recently introduced regression or not.

I just filed a bug here with my test case: [url]https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1560206[/url]