New Cluster not replicating

I created a two node cluster. Haven’t added the 3rd node yet. But the cluster is not replicating to node 2. Made a big insert on node 1, assumed it would replicate to node 2 and it has not.

Node 2 my.cnf




innodb_buffer_pool_size = 20G
innodb_buffer_pool_instances = 8
innodb_lock_wait_timeout = 3600000
wait_timeout = 86400
interactive_timeout = 86400

max_heap_table_size = 96M
sort_buffer_size = 16M
join_buffer_size = 16M
tmp_table_size = 96M
max_sort_length = 8K
max_allowed_packet = 16M
group_concat_max_len = 16M
max_connections = 200
max_connect_errors = 10

max_binlog_size = 128M
wsrep_cluster_address=gcomm:// #node 1

# Slave thread to use


show status like ‘wsrep%’

wsrep_local_state_comment:   Synced
wsrep_cluster_status:      Primary
wsrep_ready:     ON

Node 1 my.cnf is the same with a few exceptions.

wsrep_cluster_status:    Primary
wsrep_local_state_comment:  Synced 
wsrep_ready:     ON
1 Like

Hi danarashad,

  • You should add cluster ips to “wsrep_cluster_address=gcomm://” (currently is empty)

  • There are no wsrep_sst_method= nor wsrep_sst_auth= variables set. Also make sure the SST user exists on the DB

  • it caught my attention that you run in pxc_strict_mode=PERMISSIVE mode. Although it’s not a hard requirement you should run in ENFORCING

  • Check logs from both the donor and joiner to find out if they are communicating with each other and what messages you get in case SST fails


1 Like

As @CTutte says, you are missing the most critical configuration parameter


This is empty and tells each node to create its OWN 1-node cluster. You don’t have a 2-node cluster. You have 2 separate clusters of 1-node each. Change this parameter on BOTH nodes to something like this:


Then you need to boostrap node1 (to start a cluster), then start node2 normally so that it joins node1

systemctl start mysql@bootstrap     (node1)
systemctl start mysql               (node2)
1 Like

Node 1 is the bootstrapped node. So that is the reason for no IP address in the gcomm line.

1 Like

@CTutte You don’t need this in PXC8, it’s an invalid parameter now. SST auth is handled automatically.

Sorry, I did not notice there was more to your pasted config file. Scroll bars were not visible so I didn’t see anything more.

If both nodes are connected, and you are in PRIMARY state (which I see is indeed the case above), then after you COMMIT a transaction on either node, it should immediately replicate to the other node. Have you tried some simple CREATE DATABASE, CREATE TABLE and verified those appear on both nodes?

How about something like this:

  CREATE TABLE (name varchar(20));

SELECT * FROM;   (run this on both nodes)
1 Like

This is very odd, after about an hour or so the second node is now up to date. Node 1 was bootstrapped can I add Node 2 via an update statement?

1 Like

Hi, that is not possible. The cluster will automatically determine the members. You cannot force-update.

1 Like

I created a small table and inserted some data on node 1 and replication was somewhat instant.
on node 2 when i did the select i received in the error log.

MDL conflict db=foo table=bar ticket=10 solved by abort

ran the select statement again and it worked. So there seems to be a delay. Is there something in my config that might slow down the replication process? Dropping the table took a few seconds but replication was instant.

This is expected message due to the DDL you ran. By default ALTER table in PXC/Galera blocks the whole cluster. Check out this article for more info: Various Ways to Perform Schema Upgrades with Percona XtraDB Cluster - Percona Database Performance Blog

1 Like

These are quite insane values. You’re saying that a transaction can wait 41 days for another transaction to complete. You should lower all of these to more realistic values. wait_ and interactive_ should be around 5 to 10m while innodb_ should be 50s (default)

But those are not your issue. I do not see anything in your config, nor is there a parameter available, that will slow down replication. Galera replication is always immediate and no setting can change that.

That error message seems to indicate that node2 did wait for something before it was able to read the data. Can you recreate the table and add a INT-PK?

1 Like

Yes I can modify those values, it was just copied over from the previous cluster which I did not setup. I can create another table I am in the middle of waiting for a giant replication to happen on node 2.
Looking at node 2 show full processlist doesn’t show node 2 doing anything. With the exception of the datastore line. Is that the replication process happening? That is the table that was altered (update/insert/delete) from my script.

| 1 | system user     |                 | NULL      | Sleep   | 168403 | wsrep aborter idle                        | NULL                  | 168402923 |         0 |             0 |
|  2 | system user     |                 | NULL      | Query   |   5922 | wsrep: committed TOI write set (296179)   | NULL                  |   5922001 |         0 |             0 |
|  8 | event_scheduler | localhost       | NULL      | Daemon  | 129388 | Waiting on empty queue                    | NULL                  | 129387980 |         0 |             0 |
| 10 | system user     |                 | NULL      | Sleep   |   7514 | wsrep: committed TOI write set (296178)   | NULL                  |   7514568 |         0 |             0 |
| 11 | system user     |                 | NULL      | Query   |  49576 | wsrep: committed write set (296176)       | NULL                  |  49061852 |         0 |             0 |
| 12 | system user     |                 | datastore | Query   |   2885 | wsrep: writing row for write-set (296183) | NULL                  |      1189 |         0 |             0 |
| 13 | system user     |                 | NULL      | Query   |   5922 | wsrep: committed write set (296181)       | NULL                  |   5912690 |         0 |             0 |
| 14 | system user     |                 | NULL      | Query   |   5922 | wsrep: committed TOI write set (296180)   | NULL                  |   5920212 |         0 |             0 |
| 16 | system user     |                 | NULL      | Sleep   |   7549 | wsrep: committed TOI write set (296177)   | NULL                  |   7549409 |         0 |             0 |
| 17 | system user     |                 | NULL      | Query   |  49668 | wsrep: committed write set (296175)       | NULL                  |  49096776 |         0 |             0 |
| 19 | root            | localhost:59544 | cmr       | Query   |      0 | init                                      | show full processlist |         0 |         0 |             0 |
1 Like

While a table is being ALTERd, no other DML/DDL will be processed. If it takes 10m for your alter to complete, then replication will be blocked for 10m until the ALTER finishes. TOI = total order isolation, which means “everything else stops while this alter finishes” ALTER TABLE should be done very infrequently in a cluster environment.

1 Like

Here’s what I did
Node 1

CREATE TABLE (id int primary key,name varchar(20)); 
select * from;

Immediately switched to Node 2.

2022-01-20T17:21:04.732598Z 19 [Note] [MY-000000] [WSREP] MDL conflict db=foo_rep table= ticket=10 solved by abort
2022-01-20T17:21:28.595850Z 19 [Note] [MY-000000] [WSREP] MDL conflict db=foo_rep table=bar ticket=10 solved by abort

ran the query and the response is the log is above.

select * from;
ERROR 1146 (42S02): Table '' doesn't exist

select * from;
Empty set (1.01 sec)

 select * from;
| id | name |
|  1 | Fred |

There seems to be a delay. With the large amount of data i am adding there seems to be a greater delay. I am guessing this is not normal. Is there some sort of log file I can look at to maybe figure out what is going on?

1 Like

Can you stop your “large amount of data” for the purposes of these tests? I want to make sure that normally there isn’t an issue with our simple test case.

After you stop the large data, and you see the simple test works without delay, then you can start looking at streaming replication. Using Streaming Replication — Galera Cluster Documentation

Please also make sure you’ve installed/configured Percona Monitoring and Management. I’m interested to see if you are indeed having flow control issues (visible when you have “large amounts of writes”)

1 Like

I have stopped the large datasets. The simple test is giving a delay. The only transaction being run on the cluster is below.

CREATE TABLE (id int primary key,name varchar(20)); 
select * from;

Should the Percona Monitoring and Management tool get installed on a different server?

1 Like

If that simple test is giving you lag then you have some severe fundamental problem with your setup. You should be able to do 1000/sec of those simple inserts without experiencing any lag.

You should install PMM onto another, dedicated server. 4CPU/16GB ram is plenty.

1 Like

Will get that installed and setup and let you know. Thanks for your help with this.

1 Like

Think I got it figured out. The memory is maxed out. I am still in the process of setting the cluster up and Node 2 is a temp server (chip shortage) it only has 20GB of memory and btop is showing 99% of the memory used.

1 Like