client connections and write consistency question

Please bear w/ me as I am new to percona and just trying to diagnose an issue. This may be obvious/explained somewhere else, but I’d like to ask the experts on here first.

We are running a 6 node cluster with:


  • We have some JDBC code that runs across many app servers.

  • Each of these app servers may be connecting to any of the 6 pecona nodes at any time.

  • The block of code in question does the following:

a) acquires an clusterd/distributed in-memory lock (via hazelcast) that is enforced across all the app servers that enter this code block
b) selects a row to determine the latest column value (columnA) from a single row (rowA)
c) updates rowA setting columnA = to the new value
d) releases the distributed lock, permitting other app servers to acquire A, and then do B and C respectively… and on and on

(1) When the JDBC code described above uses a driver connection with “autocommit = true” turned on, we see that occasionally various servers that enter the described block, are returned the old value from columnA from rowA in the routine above. Implying that whatever node they are talking to has yet to receive the latest value from some other servers previous write.

IMPORTANT NOTE for (1): when we force ALL app servers to talk to ONE percona node explicitly (the same node all the time) this “old value read” behavior goes away

(2) When the JDBC code described above uses a driver connection with “autocommit = false” and the code is altered to explicitly issue transaction start and commit demarcations on its own, then the “reads of old values” problem described in (1) goes away.

It is my understanding that for a given client connection to a percona node, when issuing an sql write of some sort; after committing locally on the node in question, percona/galera will not release the calling thread with an OK, until that write is synchronously propagated successfully to all percona nodes in the cluster… and this is what appears to be occurring as I would expect in (2) above.

I guess my question is, how is this synchronous write consistency across the cluster different in regards to when an OK is returned to the calling client when autocommit=true VS when the client explicitly issues the commit? From what i can tell, if autocommit = true, the client thread gets an OK and releases our logical “lock” prematurely thereby permitting other app servers to aquire the lock and read from the table, getting the “yet to be updated value” because it has yet to be propagated in the cluster.

Transactions in Galera are synchronously replicated (i.e., copied) but apply and commit is asynchronous on all the other nodes besides the node your client is connected to. What Galera does do is guarantee no other transaction can conflict with that one that got committed (replicated).

So, in your situation (1) above – your other clients were simply reading from a different node from the last update that hadn’t applied the update yet. If you want read consistency here, setting ‘wsrep_causal_reads=1’ should solve your problem. (this can be set just for these particular sessions). This ensures the inbound apply queue is flushed when you issue a SELECT.

When you only do this operation on 1 node, you aren’t experiencing any apply lag, so it behaves as expected.

In situation (2), I’m assuming the flow is START, SELECT, UPDATE, COMMIT, right? If so then I would expect that some of these transactions will get a deadlock error (replication conflict) sometimes if the row had been just updated on another node. So, maybe you’re just not seeing cases where the SELECT gets an old copy because those would be the precise occasions that a deadlock would be thrown instead.

I’m not clear if your distributed application lock helps or not here. If this is just a counter you’re managing in the database, you might consider pushing it up to your fancy memory store and only flushing to the DB periodically. Realize that you can only ever update a single row in Galera at best once per RTT (between the furthest two nodes in the cluster), other attempts will get those deadlock (replication conflict) errors. So, if this value needs a faster update rate than that, memory may be a better solution.

Otherwise, I’d probably want to use the autocommit=false and possibly a SELECT … FOR UPDATE. Then your app needs to handle deadlock errors properly or only write to one node for this work. Using wsrep_causal_reads will make the selects a bit lower, but it should reduce the deadlock rate since anything selecting (and then updating) an old value will certainly get a deadlock before it can commit.

Thanks Jay

Ok so that confirms what I figured was going on. My mis-understanding was that the apply/commit across the cluster was synchronous as well, in that the issuing client thread in my application, would wait/block until that was completed across all nodes before exiting the mysql conn and then releasing its in-mem-cluster lock, thereby permitting others to “read” from percona and getting the latest data. But from what you described this is not the case, hence why autocommit vs explicit COMMIT issued by my client would still not resolve that issue. (chance of reading “yet to be synced”) data in the cluster.

“I’m not clear if your distributed application lock helps or not here. If this is just a counter you’re managing in the database”

To clarify the physical impl of this “lock” is not in the db. (just the counter). Also this counter is not being updated for every single increment, but this “select/update” routine only runs every once in a while; the app actually does this routine once to allocate itself a block of X ids (like 10 or 20) to reserve the block then the update bumps that in the db via that update. In a real world situation this “select/update” per node is only occurring once an hour maybe. My unit tests that describe the situation above, just accellerate the possibility under a much more concurrent load/induced situation.

Interesting to note, that never once in my concurrent unit tests did I ever see a “deadlock” error returned to the client (when all going to one node, OR all to different nodes). So its not clear to when or if this should have been thrown in the tests described above. I just never encountered one, but rather the “problem” just manifests itself in this “stale read of old data” type of issue.

You were writing to multiple nodes in a PXC cluster in your test environment? Getting deadlock errors requires multi-node writing. Whether or not it happens is really a race condition and sometimes you need a lot of rapid writes on a small set of nodes to trigger it.

Yes the test was against 4-6 separate nodes concurrently

Regarding “wsrep_causal_reads=1”, anyone know the best way to set this through JDBC for a connection that was just checked out from a pool? I’d want to be able to turn this on and then off on check-in… or is this something that I’ll have to configure for all connections in that pool