Hello,
we have kind of a weird problem. But first, let me describe our setup. We have a cluster of 3 percona mysql nodes which sit behind 2 ProxySQL Servers. The MySQL Cluster is used by many Webservers. This normally does not cause any problems and works fine.
But lately we are having a problem that when a lot of data is being put in, the replication somehow locks up and the webservers won’t get their querys. We can then stop the two nodes which are not the ProxySQL Writer, then the database immediatly frees up and the problem is resolved. We get no errors in our log when this happens btw. As an additional infroamtion, SELECT Querys can still be resolved while the problem exists.
Thanks for the tip. We’ve looked into it and checked with "SHOW STATUS LIKE “wsrep%” and the “myq_status” tool if flow conrol was causing the problem.
The thing is, flow control doesn’t kick in when the problem starts but a lot later when the problem allready exists and gets worse for a few minutes. We haven’t changed any settings regarding flow control yet, because it doesn’t seem like it’s causing the problem.
We also found out in our testing that the replication stop is only with one query, so one query gets stuck and all the other queries after that get resolved. When we manually kill the query everything clears up instantly. But it’s not allways the same query and the are often really simple (e.g.: INSERT INTO ‘table’ VALUES (like 5 values or something)).
We also found out that the problem exists even when we don’t use the ProxySQL Server, but it happens less often.
About Replication, the way we try it is with a mysqldump piped into a mysql command both accessing the same instance.
This leads me to believe you have a locking issue. Killing the query releases any locks which allows other queries to finish that were waiting on the locks. What’s the schema of this table? Always just a simple INSERT?
wsrep_slave_threads = 128
That is extremely high. I’ve never configured that above 32, on even the beefiest of servers.
Hi, sorry for the late answer.
We did some tests and had to wait for devs.
We reduced the slave_threads to 32 like you suggested.
We are operating in pxc_strict_mode Permissive because the typo3 backends our customer uses were not compatible. Now that changed, but this is creating new issues. On a seperate sync process we use pt-table-sync to sync between two databases. But the tool now fails saying that it need to change binlog_format to a format that isn’t supported by pxc_strict_mode. To fix this we were thinking of setting the mode to permissive while syncing and setting it back to enforcing after the sync is done.
This would work, but when we set pxc_strict_mode in runtime, it only sets the node we are on, despite it being a global variable. Is this supposed to happen or is this a bug? Wouldn’t that allow the cluster to get into an inconsistent state?
We can’t access the backend hosts directly from the sync servers because of security reasons. We can only access the database through proxysql, but this only sets one random node.
The pxc_strict_mode variable is specific to each node.
DISABLED: Do not perform strict mode validations and run as normal.
PERMISSIVE: If a validation fails, log a warning and continue running as normal.
ENFORCING: If validation fails during startup, halt the server and throw an error. If validation fails during runtime, deny the operation and throw an error.
MASTER: The same as ENFORCING except that the validation of explicit table locking is not performed. This mode can be used with clusters in which write operations are isolated to a single node.
If you have one of the node as an async master to an async replica then you should be ok with only setting the pxc node which is an async master variable pxc_strict_mode=permissive. that should work for you on solving the pt-table-sync thing.
You gotta run pt-table-checksum/sync from the async master which apparently is db2
I am assuming the error you were getting were something like
This tool requires binlog_format=STATEMENT, but the current binlog_format is set to ROW and an error occurred while attempting to change it. If running MySQL 5.1.29 or newer, setting binlog_format requires the SUPER privilege. You will need to manually set binlog_format to 'STATEMENT' before running this tool.
I would suggest getting access to that async master node and setting the
set global pxc_strict_mode='PERMISSIVE';
before the table-sync and set it back to its original value once the sync is done.
Hi,
we don’t have a replica, we have a three node galera cluster where no node is a master or replica/slave.
The thing is, the server from where the sync is triggered and executed has no direct MySQL Access to the DB Cluster for security reasons. It all goes through a ProxySQL instance. Thus we can’t control on which node the pt-table-sync commands and the SET GLOBAL command will be executed (or if we can, we haven’t found a way). Is there any way to set the variable with one query on all nodes? Or is there a ProxySQL setting that may help?
@Atlas pt-table-sync is designed to synchronize tables in async setups (source->replica). You are using PXC which is not async replication. I do not believe pt-t-s will work for you here. Also, I don’t see why you need it since PXC maintains sync state with all data at all times. If you have discovered an inconsistency on node2, then you should stop that node, erase the data and force an SST to get a correct copy of the data from another node. Or, go to the correct node, make a copy of the table, (CREATE TABLE … LIKE …) and insert the data from old to new (INSERT INTO … SELECT FROM …), drop the old and rename new to old. That will “sync” the data between all nodes.
Ah yes, to clarify this, we missuse pt-table-sync a bit. We use it not to synchronize replication, but to synchronize two different databases, like from a test environment to stage. We use this because mysqldump would produce a small downtime which is not acceptable for the customer. This is not perfect, but it works. But then I see why these two features don’t work well together.
@Atlas I encourage you to open a feature request/bug report on https://jira.percona.com for the percona toolkit project to let pt-t-s ignore the binlog format. When sync’ing two servers that are not connected via replication, there is no binlog involved so you should be able to disable that check in pt-t-c but I don’t see it in the doc pages. Seems like a small oversight.
Why would mysqldump produce a downtime? If you just want to dump 1 table, and you use --single-transaction then there should be no downtime as both writers and readers will be able to run while the dump is in progress.
Thanks for the tip, I created a Job: [PT-2046] Option to ignore Binlog Format for pt-table-sync - Percona JIRA
With downtime I mean a downtime from the web application because it can’t access the tables while they are beeing imported, either because of locking or index rebuilding or something.
I now tried to set pxc_strict_mode with a Trigger:
CREATE TRIGGER update_pxc_strict_mode
AFTER UPDATE
ON pxc_strict_mode_table
FOR EACH ROW
BEGIN
IF NEW.setting = "PERMISSIVE" THEN
SET GLOBAL pxc_strict_mode = "PERMISSIVE";
ELSEIF NEW.setting = "ENFORCING" THEN
SET GLOBAL pxc_strict_mode = "ENFORCING";
END IF;
END;
But this still won’t work with the replication between the nodes. When I do “UPDATE pxc_strict_mode_table SET setting = ‘PERMISSIVE’;” the variable is set on the active system, so the trigger works, but it doesn’t work on the other nodes. The table is updated, but the system variable doesn’t change.
I also tried using a procedure like @mughees52 suggested:
CREATE PROCEDURE update_pxc_strict_mode_permissive()
BEGIN
SET GLOBAL pxc_strict_mode = "PERMISSIVE";
END //
When executing “CALL update_pxc_strict_mode_permissive();” the same thing happens, the local system variable changes, but the variables on the other nodes stay the same.