Delay in user synchronization across Percona XtraDB Cluster nodes

Hi everyone,

I’m currently managing a Percona XtraDB Cluster (PXC) environment with the following setup:

  • Cluster version: 8.0.43-34.1 - Percona XtraDB Cluster (GPL), Release rel34, Revision 0682ba7, WSREP version 26.1.4.3

  • Topology: 2-node master-master cluster + 1 arbitrator.

Our database is accessed by two types of servers:

  1. “Front” servers: Originally connected to Node 1 via HAProxy, with automatic failover to Node 2 if Node 1 crashes. Recently, some of these servers were reconfigured to use HAProxy for load balancing between both nodes.

  2. “Management” server: Directly connected to Node 1 only.

We have a script on the management server that:

  • Creates a new database in the cluster.

  • Creates a MySQL user with access only to that database.

  • Triggers an API call on a front server, which uses the newly created user to access the database.

Issue: Between the user creation and the API call, there is a delay of over 1 second. Recently, we encountered a bug where the API call was routed to Node 2, which returned an error stating the user did not exist. By the time we received the error and retried manually, the user had synchronized.

Questions:

  • Why isn’t user synchronization instantaneous across nodes?

  • How and where is user synchronization configured in PXC?

  • What are the limitations or key points to be aware of regarding user synchronization?

I didn’t set up the cluster myself, so I’m not entirely sure how everything is configured, but I can access the servers if more details are needed.

Thanks in advance for your insights!

1 Like

Could you share your wsrep configuration?

show global variables like '%wsrep%';
show global status like '%wsrep%';

Yep :slight_smile:

show global variables like '%wsrep%';

Variable_name Value
wsrep_applier_FK_checks ON
wsrep_applier_threads 48
wsrep_applier_UK_checks OFF
wsrep_auto_increment_control ON
wsrep_causal_reads OFF
wsrep_certification_rules strict
wsrep_certify_nonPK ON
wsrep_cluster_address gcomm://10.xx.xx.3,10.xx.xx.4
wsrep_cluster_name cluster_xxxxxxxxx_01
wsrep_data_home_dir /var/lib/mysql/
wsrep_dbug_option
wsrep_debug NONE
wsrep_desync OFF
wsrep_dirty_reads OFF
wsrep_disk_pages_encrypt NONE
wsrep_gcache_encrypt NONE
wsrep_ignore_apply_errors 0
wsrep_load_data_splitting OFF
wsrep_log_conflicts ON
wsrep_max_ws_rows 0
wsrep_max_ws_size 2147483647
wsrep_min_log_verbosity 3
wsrep_mode
wsrep_node_address 10.xx.xx.4
wsrep_node_incoming_address AUTO
wsrep_node_name xxxx-04
wsrep_notify_cmd
wsrep_OSU_method TOI
wsrep_provider /usr/lib/galera4/libgalera_smm.so
wsrep_provider_options allocator.disk_pages_encryption = no; allocator.encryption_cache_page_size = 32K; allocator.encryption_cache_size = 16777216; base_dir = /var/lib/mysql/; base_host = 10.xx.xx.4; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = no; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 4; evs.version = 1; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.encryption = no; gcache.encryption_cache_page_size = 32K; gcache.encryption_cache_size = 16777216; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_auto_evict_threshold = 0.75; gcs.fc_auto_evict_window = 0; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.xx.xx.4; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.wait_restored_prim_timeout = PT0S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 11; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto;
wsrep_recover OFF
wsrep_reject_queries NONE
wsrep_replicate_myisam OFF
wsrep_restart_replica OFF
wsrep_restart_slave OFF
wsrep_retry_autocommit 1
wsrep_RSU_commit_timeout 5000
wsrep_slave_FK_checks ON
wsrep_slave_threads 48
wsrep_slave_UK_checks OFF
wsrep_SR_store table
wsrep_sst_allowed_methods xtrabackup-v2,clone
wsrep_sst_donor
wsrep_sst_donor_rejects_queries OFF
wsrep_sst_method xtrabackup-v2
wsrep_sst_receive_address AUTO
wsrep_start_position c6f4c786-00be-11f0-8ecf-a2d19a721caa:112038713
wsrep_sync_wait 0
wsrep_trx_fragment_size 0
wsrep_trx_fragment_unit bytes
wsrep_use_async_monitor ON

show global status like '%wsrep%';

Variable_name Value
wsrep_local_state_uuid c6f4c786-00be-11f0-8ecf-a2d19a721caa
wsrep_protocol_version 11
wsrep_last_applied 115069889
wsrep_protocol_application 4
wsrep_protocol_replicator 11
wsrep_protocol_GCS 5
wsrep_last_committed 115069889
wsrep_monitor_status (L/A/C) [ (3072710, 3072710), (115069889, 115069889), (115069889, 115069889) ]
wsrep_replicated 719971
wsrep_replicated_bytes 1014330176
wsrep_repl_keys 7009970
wsrep_repl_keys_bytes 73362608
wsrep_repl_data_bytes 892398608
wsrep_repl_other_bytes 0
wsrep_received 2352690
wsrep_received_bytes 3357110165
wsrep_local_commits 719898
wsrep_local_cert_failures 64
wsrep_local_replays 1
wsrep_local_send_queue 0
wsrep_local_send_queue_max 12
wsrep_local_send_queue_min 0
wsrep_local_send_queue_avg 0.000345927
wsrep_local_recv_queue 0
wsrep_local_recv_queue_max 58
wsrep_local_recv_queue_min 0
wsrep_local_recv_queue_avg 0.00685131
wsrep_local_cached_downto 114954454
wsrep_flow_control_paused_ns 446409
wsrep_flow_control_paused 7.2023e-10
wsrep_flow_control_sent 0
wsrep_flow_control_recv 1
wsrep_flow_control_active false
wsrep_flow_control_requested false
wsrep_flow_control_interval [ 141, 141 ]
wsrep_flow_control_interval_low 141
wsrep_flow_control_interval_high 141
wsrep_flow_control_status OFF
wsrep_cert_deps_distance 1.03947
wsrep_apply_oooe 0.0242157
wsrep_apply_oool 0.000235552
wsrep_apply_window 1.08584
wsrep_apply_waits 184710
wsrep_commit_oooe 0
wsrep_commit_oool 0
wsrep_commit_window 1.02072
wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 230
wsrep_cert_bucket_count 85229
wsrep_gcache_pool_size 134219032
wsrep_causal_reads 64
wsrep_cert_interval 2033.03
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 10.xx.xx.4:3306,10.xx.xx.3:3306
wsrep_cluster_weight 3
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0.000368835/0.000548454/0.000815358/0.000100379/22
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid 06cee51f-ae37-11f0-a011-2f2d9acb8a6b
wsrep_gmcast_segment 0
wsrep_cluster_capabilities
wsrep_cluster_conf_id 37
wsrep_cluster_size 3
wsrep_cluster_state_uuid c6f4c786-00be-11f0-8ecf-a2d19a721caa
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_local_bf_aborts 21
wsrep_local_index 0
wsrep_provider_capabilities :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO:
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <info@codership.com> (modified by Percona <https://percona.com/&gt;\)
wsrep_provider_version 4.23(cb05b32)
wsrep_ready ON
wsrep_thread_count 49

It should be. CREATE USER, and other DCL are replicated synchronously. What’s in your error logs on both nodes during user creation?

I see some flow control time; are you monitoring this? What’s your TPS/QPS?

Chceking with error log and flow control, additionally you can define in your API to set
wsrep_sync_wait = 1 on session level if you dont want globally to have reads to be done when synced.

Around 1.5k QPS on the cluster…
(We have over 1,000 databases and users in the cluster.)

I haven’t encountered any real errors. In this case the first call of the API throw :
SQLSTATE[HY000] [1045] Access denied for user 'xxxxxxxx'@'10.xxx.xxx.2' (using password: YES)

This hasn’t happened again since :man_shrugging: , but I’ll dig into the logs (which go back 15 days, ugh). I’ll also add wsrep_sync_wait = 1 to the creation script to prevent this from happening again.

SELECT COUNT(*) FROM mysql.user if that is over 1000 entries, then that is probably the issue. There is an internal caching table that holds user permissions for fast lookup. If you have that high of users, then the delay would be understandable. Each time you CREATE USER, or FLUSH PRIVILEGES, that table is destroyed and rebuilt.

1067 users.
1047 database.

So, look like i’ll add a simple sleep in my script :confused:

Don’t add a simple sleep, use wsrep_sync_wait. You can do that like this:

mysql> SELECT /*+ SET_VAR(wsrep_sync_wait=1) */ user, host FROM mysql.user WHERE user = 'foobar';

or do SET SESSION wsrep_sync_wait=1, then execute the query.