Flow control state initiated by asynchronous traffic

We are trying to get on a 5 node Percona cluster. We are replicating from our current environment which is 2 masters and 2 slaves. Of course we’re reading the binary log from one master to a node in XtraDB cluster. We are running only a very light demo instance on the cluster currently. We are finding periods of time where the async replication causes a flow control state in XtraDB Cluster.

My theory is that there is a stored procedure being run that would be one transaction which modifies a large amount of data causing the flow control to back up. However, I’m having trouble proving that. I only know about the event after the event. I am looking for a needle in a hay farm. The stored procedure is my latest guess as to what is going on.

I am looking for advice on how to precisely capture the moment and transactions which are causing the flow control state, and fix that in my application.

I am guessing that any other writes, no matter how quickly they came in from async replication would be absorbed by galera as nothing.

Any advice on how to pinpoint the transaction/transactions causing the Flow Control state would be appreciated. We’ve seen up to 20 minute halt on the server due to this.

Hey, this is very interesting…

I recommend you to take a look on both below pages to read more about Flow Control and understand what needs to be tuned in this situation.

=> http://galeracluster.com/documentation-webpages/nodestates.html
=> http://www.percona.com/blog/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/

Let us know if it helps…

Thank you for the response,
I have read both of those articles. We have about 4000 transactions/second, knowing what is clogging the pipes is not easy. Especially since the binary logs are tremendously large.

I get the general concept. If a received queue of 500(in our case) writes builds up on any node, the nodes stop replicating, which means that we cannot write to our session table, which takes the application down, and the process list shows many transactions in a pre-commit stage. Things that could cause this:

Large and long table locks, this would likely show on our slow log. Nothing extraordinary shows which would indicate that this is happening. Further, We have separate databases for each client, which would result in more small locks instead of a large lock.

DDL, we’re not pumping DDL via the app.

Writes that come in so quickly via the binary log that it jams Galera. I find this difficult to grasp as the async replication is a single thread and we have 50 threads open to replicate to the cluster.

Very large updates or deletes. This may be something, but 20 minutes would be a long time to be down, we don’t have any transactional tables exceeding 5M rows. We have re-written our MySQL object to disallow more than 1000 writes in a single statement. That doesn’t mean other things can’t cause more than 1000 writes. Procedures, etc. But we’re not doing million row updates or deletes.

My theory on the stored procedure is that I don’t know how it interacts with the binary log. I know that it’s not a single transaction in the sense that if it fails in a cursor the rows already affected won’t be rolled back. However, I don’t know if it writes the results to the binary log once it’s done or piecemeal.

The question is less “what’s causing the issue?”, but more “what tools and/or strategy would you use to pinpoint the misbehaving item(s)?”.

What I understood about your last message is that you’re a little bit lost regarding what’s happening on your systems and databases or even, you know that something harmful is happening. Consider to tune the Galera’s Flow Control, the adoption of PERFORMANCE_SCHEMA to check what’s happening piece-by-piece using the stage/statements instruments and then, on the programs that runs against the database, try divide and conquer.

BTW, not sure if your company has a support to analise your environment and give you the best opinion ever about what’s happening, but, consider to have the Percona’s guys having a look on this problem and with a manner time, adjust the use of Galera Cluster.

This is my 2 cents…

I know something is happening that is causing flow control. This is an unacceptable condition as it would cause downtime for all of our clients. I will look into more monitoring with the performance schema.

You are correct, I am a little lost, I suspect that nearly 100% of the people posting on forums are a little lost. I am in the quite common position of inheriting control of a database with less than optimal code. I don’t know every query that comes through. There are 4000 per second. An extraordinarily small set of these is causing flow control. We cannot migrate to a reportedly superior product until we have zero flow control events. Simply increasing the variables is not the solution, because we don’t know what is causing it, and at what threshold it would.

I was presented with a very unimpressive sales presentation from Percona. The sales agent didn’t respond when he said he would nor with the information he said he would. This makes our company weary to enter into a support contract with Percona.

Can you answer whether or not a single stored procedure call would write to the binary log on a 5.5 server (with Row Based Replication) in one large chunk, or would it write as statements are processed?

Yes, that can happen mainly when the data is changed almost at the same time, considering the change’s timestamp and binary log group commit, that will write many changes to the binary log at the same time (~ 300ms).

Have a look on this: http://www.percona.com/doc/percona-s…up_commit.html

I think you must monitor the amount of operations being done by the Stored Procedure to measure how large you need to configure the Flow Control to avoid stopping the cluster replication and transactions in pre-commit status.

Very helpful, thank you! I have a point of attack now.