PXC as async slave - replication stuck from plain Percona Server

Hello,

We have been using a plain master-slave setup until now and are looking into moving to PXC.

To have the smoothest live transition possible and take 0 risks of harming our live production, we want to have the PXC cluster be an asynchronous slave of the current plain master-slave pair, and then move reads and finally writes to PXC.


My method so far to provision the slave PXC cluster has been:

  1. backup/restore with xtrabackup for the first PXC node, with wsrep_* commented out (ie running as a plain node)
  2. check that replication is all ok
  3. Uncomment wsrep_* and bootstrap the node
  4. check that replication is still all ok with wsrep_* flags uncommented
  5. add other PXC nodes

This has worked just fine in dev so far, but doesn’t work for our production DB with identical master/slave and PXC configurations, and I cannot diagnose the issue so far.

The problem is that my prod PXC node cannot replicate and I am not sure why. Replication starts but immediately gets stuck, without any particular error/warning in the logs. The Exec_Source_Log_Pos variable never changes…

I can then re-comment-out wsrep_* and catch up, but whenever I uncomment them and restart I’m back to square 1.

This is more or less the same issue as PXC 8.x replication issues, and if I try to force-start things similarly I run into the same errors as that user in the end…

In there @matthewb suggests:

As my last piece of advice, I would try the following:

  1. Erase PXC1 datadir
  2. Ensure that all GTID-related configuration is set on mysql1
  3. Take a new xtrabackup of mysql1 and restore to PXC1
  4. Start PXC1 in bootstrap mode
  5. Configure PXC1 as replica using GTID

Which I’ve tried multiple times, with either of SOURCE_LOG_{FILE, POS} and SOURCE_AUTO_POS, both without success…

In case it somehow helps, these are the relevant configurations, on master/slave’s master:

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
datadir         = /var/lib/mysql

skip-name-resolve
myisam-recover-options  = BACKUP
log_error = /var/log/mysql/error.log

server-id                   = 1
log_bin                     = /var/log/mysql/mysql-bin.log
binlog_expire_logs_seconds  = 864000
binlog_format               = ROW
max_binlog_size             = 100M

gtid_mode=ON
enforce-gtid-consistency=ON

innodb_buffer_pool_size = 24G

sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,NO_AUTO_VALUE_ON_ZERO"
character-set-server=utf8mb4
wait-timeout=200
interactive-timeout=200

and on PXC1 side:

[client]
socket=/var/run/mysqld/mysqld.sock

[mysqld]
bind-address=0.0.0.0
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
datadir         = /var/lib/mysql
skip-name-resolve

log_output = FILE
log_error = /var/log/mysql/error.log

server-id                   = 12

log_bin                     = /var/log/mysql/mysql-bin
binlog_expire_logs_seconds  = 864000
binlog_format               = ROW
log_replica_updates         = 1

max_binlog_size             = 100M
gtid_mode=ON
enforce-gtid-consistency=ON

relay_log=relay-bin

innodb_buffer_pool_size = 8G
innodb_log_file_size = 256M
innodb_log_files_in_group=2
innodb_autoinc_lock_mode=2

sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,NO_AUTO_VALUE_ON_ZERO"
character-set-server=utf8mb4
wait-timeout=200
interactive-timeout=200

default_authentication_plugin=mysql_native_password

binlog_expire_logs_seconds=604800

######## wsrep ###############
pxc_strict_mode=ENFORCING

wsrep_provider=/usr/lib/galera4/libgalera_smm.so
wsrep_cluster_address=gcomm://
wsrep_applier_threads=8
wsrep_log_conflicts=ON
wsrep_cluster_name=pxc-cluster
wsrep_node_name=pxc-cluster-node-1
wsrep_sst_method=skip
wsrep_debug=CLIENT

#wsrep_node_address=xxxxx

When running with wsrep_* commented out:

mysql> SHOW REPLICA STATUS \G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: $redacted
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.013277
          Read_Source_Log_Pos: 44033822
               Relay_Log_File: relay-bin.000009
                Relay_Log_Pos: 44009214
        Relay_Source_Log_File: mysql-bin.013277
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 44009078
              Relay_Log_Space: 44034249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File:
           Source_SSL_CA_Path:
              Source_SSL_Cert:
            Source_SSL_Cipher:
               Source_SSL_Key:
        Seconds_Behind_Source: 0
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Source_Server_Id: 1
                  Source_UUID: ffd08e3b-5120-11ec-b049-bac207f913f9
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: Replica has read all relay log; waiting for more updates
           Source_Retry_Count: 86400
                  Source_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Source_SSL_Crl:
           Source_SSL_Crlpath:
           Retrieved_Gtid_Set: ffd08e3b-5120-11ec-b049-bac207f913f9:476689337-476750187
            Executed_Gtid_Set: 0ab9f887-a52d-11eb-840e-bac207f913f9:1-141256875,
90cd250e-a4f0-11eb-9917-bac207f913f9:1-2,
ffd08e3b-5120-11ec-b049-bac207f913f9:1-476750187
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Source_TLS_Version:
       Source_public_key_path:
        Get_Source_public_key: 0
            Network_Namespace:
1 row in set (0.00 sec)

And when running with wsrep_* uncommented:

mysql> SHOW REPLICA STATUS \G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: $redacted
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.013277
          Read_Source_Log_Pos: 68378769
               Relay_Log_File: relay-bin.000003
                Relay_Log_Pos: 419
        Relay_Source_Log_File: mysql-bin.013277
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 51875810
              Relay_Log_Space: 16503762
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File:
           Source_SSL_CA_Path:
              Source_SSL_Cert:
            Source_SSL_Cipher:
               Source_SSL_Key:
        Seconds_Behind_Source: 244
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Source_Server_Id: 1
                  Source_UUID: ffd08e3b-5120-11ec-b049-bac207f913f9
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: Waiting for dependent transaction to commit
           Source_Retry_Count: 86400
                  Source_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Source_SSL_Crl:
           Source_SSL_Crlpath:
           Retrieved_Gtid_Set: ffd08e3b-5120-11ec-b049-bac207f913f9:476752120-476756546
            Executed_Gtid_Set: 0ab9f887-a52d-11eb-840e-bac207f913f9:1-141256875,
90cd250e-a4f0-11eb-9917-bac207f913f9:1-2,
ffd08e3b-5120-11ec-b049-bac207f913f9:1-476752121
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Source_TLS_Version:
       Source_public_key_path:
        Get_Source_public_key: 0
            Network_Namespace:
1 row in set (0.00 sec)

ERROR:
No query specified

mysql> show processlist;
+----+-----------------+-----------------+------+---------+------+------------------------------------------------+------------------+---------+-----------+---------------+
| Id | User            | Host            | db   | Command | Time | State                                          | Info             | Time_ms | Rows_sent | Rows_examined |
+----+-----------------+-----------------+------+---------+------+------------------------------------------------+------------------+---------+-----------+---------------+
|  5 | event_scheduler | localhost       | NULL | Daemon  |   88 | Waiting on empty queue                         | NULL             |   87886 |         0 |             0 |
|  6 | system user     |                 | NULL | Sleep   |   88 | innobase_commit_low (-1)                       | NULL             |   87865 |         0 |             0 |
|  7 | system user     |                 | NULL | Sleep   |   88 | wsrep aborter idle                             | NULL             |   87865 |         0 |             0 |
| 10 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87860 |         0 |             0 |
| 11 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 12 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 13 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 14 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 15 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 16 | system user     |                 | NULL | Sleep   |   88 | wsrep: applier idle                            | NULL             |   87859 |         0 |             0 |
| 18 | root            | localhost       | NULL | Query   |    0 | init                                           | show processlist |       0 |         0 |             0 |
| 19 | system user     | connecting host | NULL | Connect |   16 | Waiting for source to send event               | NULL             |   16015 |         0 |             0 |
| 20 | system user     |                 | NULL | Query   |   16 | Waiting for dependent transaction to commit    | NULL             |   15956 |         0 |             0 |
| 21 | system user     |                 | NULL | Query   |  250 | Waiting for an event from Coordinator          | NULL             |   15945 |         0 |             0 |
| 22 | system user     |                 | NULL | Query   |  250 | wsrep: write set replicated and certified (20) | NULL             |   15921 |         0 |             0 |
| 23 | system user     |                 | NULL | Query   |  250 | Waiting for preceding transaction to commit    | NULL             |   15923 |         0 |             0 |
| 24 | system user     |                 | NULL | Connect |   16 | Waiting for an event from Coordinator          | NULL             |   16012 |         0 |             0 |
+----+-----------------+-----------------+------+---------+------+------------------------------------------------+------------------+---------+-----------+---------------+
17 rows in set (0.00 sec)

Any suggestions of where to look for what’s wrong would be greatly appreciated, as START REPLICA doesn’t cause more to show in the logs than:

2022-08-06T11:11:13.538547Z 19 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2022-08-06T11:11:13.539763Z 19 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'repl@$redacted:3306',replication started in log 'FIRST' at position 4
1 Like

As for the exact transaction, it’s not one in particular, and we don’t have any kind of massive writes. However it is interesting that the binlog event seems to be empty?

mysql> SHOW REPLICA STATUS \G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: $redacted
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.013278
          Read_Source_Log_Pos: 103570839
               Relay_Log_File: relay-bin.000003
                Relay_Log_Pos: 419
        Relay_Source_Log_File: mysql-bin.013277
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 51882588
              Relay_Log_Space: 156548751
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File:
           Source_SSL_CA_Path:
              Source_SSL_Cert:
            Source_SSL_Cipher:
               Source_SSL_Key:
        Seconds_Behind_Source: 2181
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Source_Server_Id: 1
                  Source_UUID: ffd08e3b-5120-11ec-b049-bac207f913f9
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: Waiting for dependent transaction to commit
           Source_Retry_Count: 86400
                  Source_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Source_SSL_Crl:
           Source_SSL_Crlpath:
           Retrieved_Gtid_Set: ffd08e3b-5120-11ec-b049-bac207f913f9:476752122-476792582
            Executed_Gtid_Set: 0ab9f887-a52d-11eb-840e-bac207f913f9:1-141256875,
90cd250e-a4f0-11eb-9917-bac207f913f9:1-2,
ffd08e3b-5120-11ec-b049-bac207f913f9:1-476752135
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Source_TLS_Version:
       Source_public_key_path:
        Get_Source_public_key: 0
            Network_Namespace:
1 row in set (0.00 sec)

mysql> SHOW BINLOG EVENTS IN 'mysql-bin.013278' FROM 51882588;
Empty set (0.00 sec)
1 Like

If someone runs into this, we finally found the solution. In a recent mysql update slave_parallel_workers has changed from 0 to 4 as default value.

This makes one run into the issue mentioned here, whose solution works just fine too: Technology sharing | sudden master-slave replication delay of MySQL | Develop Paper

Now why this only caused issue with GR enabled I have no idea, but setting slave_parallel_workers=0 fixed it nonetheless and I can’t say I’ll look into that more as having PXC as async slave is merely a migration step for us.

1 Like

Glad you found the issue. Point of clarification, you do not have GR (group replication) involved here. You are using PXC which is Galera-based.

1 Like

Ah yep woops. Though I never got to why that was specifically problematic with PXC’s galera features (if we can call it that)

1 Like