PXC 5.7.32-35 node crashing (signal 11) when joining the cluster

Hi,

We have a client’s Percona XtraDB Cluster 5.7.32-35 3x node cluster (running in AWS environment) where one node was killed a couple of weeks back by the linux OOM killer and since then we have been unable to get this node to rejoin the cluster. The other two nodes seems fine but each time we start the problem one, it completes SST (usually around 90mins) then falls over at the “WSREP: Shifting JOINER → JOINED” stage.

The /var/lib/mysql dir is always cleared out prior to each restart and the crash is always of the form:

2021-09-24T09:55:52.308457-05:00 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 184316203)
13:55:52 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://jira.percona.com/projects/PXC/issues

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=0
max_threads=501
thread_count=9
connection_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 232161 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

2021-09-24T09:55:52.906019-05:00 0 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
Build ID: 591c8fe13efc657ca532af3e8a3bf148c55f8cb8
Server Version: 5.7.32-35-57 Percona XtraDB Cluster (GPL), Release rel35, Revision 2055835, WSREP version 31.47, wsrep_31.47

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf5aa9b]
/usr/sbin/mysqld(handle_fatal_signal+0x505)[0xd6e325]
/lib64/libpthread.so.0(+0xf630)[0x7f7a73fcb630]
/usr/lib64/galera3/libgalera_smm.so(+0x22d53)[0x7f7a6742dd53]
/usr/lib64/galera3/libgalera_smm.so(+0x4e9c7)[0x7f7a674599c7]
/usr/lib64/galera3/libgalera_smm.so(+0x16866f)[0x7f7a6757366f]
/usr/lib64/galera3/libgalera_smm.so(+0x169332)[0x7f7a67574332]
/lib64/libpthread.so.0(+0x7ea5)[0x7f7a73fc3ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f7a721909fd]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

We’ve done lots of googling and can find lots of nearly but not quite issues, most seemingly for considerably older versions of PXC.

Greatly appreciate any wisdom or advice on what else to try.

System Details:

  • Server type: m5.8xlarge (32x VCPU 128GB RAM)
  • OS: RHEL7.9
  • PXC Version: 5.7.32-35
  • XtraBackup Version: 2.4.21
  • SELinux: disabled

Log (from startup to crash - ip addresses masked):

2021-09-24T08:26:23.069804-05:00 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2021-09-24T08:26:23.071527-05:00 0 [Note] /usr/sbin/mysqld (mysqld 5.7.32-35-57) starting as process 27054 ...
2021-09-24T08:26:23.075455-05:00 0 [Note] WSREP: Setting wsrep_ready to false
2021-09-24T08:26:23.075479-05:00 0 [Note] WSREP: No pre-stored wsrep-start position found. Skipping position initialization.
2021-09-24T08:26:23.075490-05:00 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera3/libgalera_smm.so'
2021-09-24T08:26:23.076300-05:00 0 [Note] WSREP: wsrep_load(): Galera 3.47(raf7cd63) by Codership Oy <info@codership.com> loaded successfully.
2021-09-24T08:26:23.076415-05:00 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2021-09-24T08:26:23.076796-05:00 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2021-09-24T08:26:23.079249-05:00 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = <node3-ip-address>; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 4G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 9; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto; 
2021-09-24T08:26:23.090990-05:00 0 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 00000000-0000-0000-0000-000000000000:-1
2021-09-24T08:26:23.091492-05:00 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2021-09-24T08:26:23.091509-05:00 0 [Note] WSREP: Preparing to initiate SST/IST
2021-09-24T08:26:23.091529-05:00 0 [Note] WSREP: Starting replication
2021-09-24T08:26:23.091540-05:00 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2021-09-24T08:26:23.091674-05:00 0 [Note] WSREP: Using CRC-32C for message checksums.
2021-09-24T08:26:23.091742-05:00 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2021-09-24T08:26:23.091821-05:00 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2021-09-24T08:26:23.091834-05:00 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2021-09-24T08:26:23.092003-05:00 0 [Note] WSREP: GMCast version 0
2021-09-24T08:26:23.092213-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2021-09-24T08:26:23.092225-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2021-09-24T08:26:23.092494-05:00 0 [Note] WSREP: EVS version 0
2021-09-24T08:26:23.092598-05:00 0 [Note] WSREP: gcomm: connecting to group 'pxc-cluster', peer '<node1-ip-address>:,<node2-ip-address>:,<node3-ip-address>:'
2021-09-24T08:26:23.093192-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://<node3-ip-address>:4567
2021-09-24T08:26:23.093698-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') connection established to d5aa8aca tcp://<node2-ip-address>:4567
2021-09-24T08:26:23.093762-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2021-09-24T08:26:23.095692-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') connection established to 6a281051 tcp://<node1-ip-address>:4567
2021-09-24T08:26:23.594928-05:00 0 [Note] WSREP: declaring 6a281051 at tcp://<node1-ip-address>:4567 stable
2021-09-24T08:26:23.594957-05:00 0 [Note] WSREP: declaring d5aa8aca at tcp://<node2-ip-address>:4567 stable
2021-09-24T08:26:23.595897-05:00 0 [Note] WSREP: Node 6a281051 state primary
2021-09-24T08:26:23.596846-05:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,6a281051,51)
memb {
	6a281051,0
	a12745b6,0
	d5aa8aca,0
	}
joined {
	}
left {
	}
partitioned {
	}
)
2021-09-24T08:26:23.596868-05:00 0 [Note] WSREP: Save the discovered primary-component to disk
2021-09-24T08:26:24.093095-05:00 0 [Note] WSREP: gcomm: connected
2021-09-24T08:26:24.093175-05:00 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2021-09-24T08:26:24.093233-05:00 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2021-09-24T08:26:24.093264-05:00 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2021-09-24T08:26:24.093265-05:00 0 [Note] WSREP: Waiting for SST/IST to complete.
2021-09-24T08:26:24.093316-05:00 0 [Note] WSREP: STATE EXCHANGE: sent state msg: a174b257-1d32-11ec-b764-a74e48ed5148
2021-09-24T08:26:24.093329-05:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: a174b257-1d32-11ec-b764-a74e48ed5148 from 0 (pxc-node-322)
2021-09-24T08:26:24.093336-05:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: a174b257-1d32-11ec-b764-a74e48ed5148 from 2 (pxc-node-323)
2021-09-24T08:26:24.093424-05:00 1 [Note] WSREP: wsrep running threads now: 1
2021-09-24T08:26:24.093456-05:00 2 [Note] WSREP: wsrep running threads now: 2
2021-09-24T08:26:24.094332-05:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: a174b257-1d32-11ec-b764-a74e48ed5148 from 1 (pxc-node-383)
2021-09-24T08:26:24.094348-05:00 0 [Note] WSREP: Quorum results:
	version    = 6,
	component  = PRIMARY,
	conf_id    = 47,
	members    = 2/3 (primary/total),
	act_id     = 183905234,
	last_appl. = -1,
	protocols  = 0/9/3 (gcs/repl/appl),
	group UUID = 727b6fd8-f9e4-11eb-b574-6ae063fe35b9
2021-09-24T08:26:24.094359-05:00 0 [Note] WSREP: Flow-control interval: [173, 173]
2021-09-24T08:26:24.094365-05:00 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 183905234)
2021-09-24T08:26:24.094413-05:00 2 [Note] WSREP: State transfer required: 
	Group state: 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:183905234
	Local state: 00000000-0000-0000-0000-000000000000:-1
2021-09-24T08:26:24.094431-05:00 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
2021-09-24T08:26:24.094439-05:00 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
2021-09-24T08:26:24.094450-05:00 2 [Note] WSREP: New cluster view: global state: 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:183905234, view# 48: Primary, number of nodes: 3, my index: 1, protocol version 3
2021-09-24T08:26:24.094459-05:00 2 [Note] WSREP: Setting wsrep_ready to true
2021-09-24T08:26:24.094464-05:00 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2021-09-24T08:26:24.094469-05:00 2 [Note] WSREP: Setting wsrep_ready to false
2021-09-24T08:26:24.094989-05:00 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '<node3-ip-address>' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '27054' --mysqld-version '5.7.32-35-57'   '' )
	2021-09-24T08:26:24.149173-04:00 WSREP_SST: [DEBUG] The xtrabackup version is 2.4.21
	2021-09-24T08:26:24.214965-04:00 WSREP_SST: [WARNING] wsrep_node_address or wsrep_sst_receive_address not set. Consider setting them if SST fails.
	2021-09-24T08:26:24.379644-04:00 WSREP_SST: [INFO] Streaming with xbstream
	2021-09-24T08:26:24.381302-04:00 WSREP_SST: [DEBUG] Using socat as streamer
	2021-09-24T08:26:24.385814-04:00 WSREP_SST: [WARNING] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
	2021-09-24T08:26:24.405284-04:00 WSREP_SST: [DEBUG] 1112: Using /proc/pid/net/tcp for socat/nc discovery
	2021-09-24T08:26:24.408662-04:00 WSREP_SST: [DEBUG] 1143: local_address index is 2
	2021-09-24T08:26:24.412543-04:00 WSREP_SST: [DEBUG] 1156: Entering loop body : 1
	2021-09-24T08:26:24.425789-04:00 WSREP_SST: [DEBUG] Evaluating socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio | xbstream -x  $xbstream_eopts
	2021-09-24T08:26:24.632694-04:00 WSREP_SST: [DEBUG] 1156: Entering loop body : 2
	2021-09-24T08:26:24.651569-04:00 WSREP_SST: [DEBUG] 1165: Examining pid: 27541
	2021-09-24T08:26:24.759453-04:00 WSREP_SST: [DEBUG] 1182: sockets: 8718194|8718195|8718196
	2021-09-24T08:26:24.765173-04:00 WSREP_SST: [DEBUG] 1206: found a match for pid: 27541
	2021-09-24T08:26:24.766593-04:00 WSREP_SST: [DEBUG] 1222: wait_for_listen() exiting
2021-09-24T08:26:24.766868-05:00 2 [Note] WSREP: Prepared SST/IST request: xtrabackup-v2|<node3-ip-address>:4444/xtrabackup_sst//1
2021-09-24T08:26:24.766892-05:00 2 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 3)
2021-09-24T08:26:24.766907-05:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-09-24T08:26:24.766936-05:00 2 [Note] WSREP: Assign initial position for certification: 183905234, protocol version: 4
2021-09-24T08:26:24.766979-05:00 0 [Note] WSREP: Service thread queue flushed.
2021-09-24T08:26:24.767022-05:00 2 [Note] WSREP: Check if state gap can be serviced using IST
2021-09-24T08:26:24.767039-05:00 2 [Note] WSREP: Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 727b6fd8-f9e4-11eb-b574-6ae063fe35b9
2021-09-24T08:26:24.767080-05:00 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST
2021-09-24T08:26:24.767092-05:00 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (727b6fd8-f9e4-11eb-b574-6ae063fe35b9): 1 (Operation not permitted)
	 at galera/src/replicator_str.cpp:prepare_for_IST():516. IST will be unavailable.
2021-09-24T08:26:24.768134-05:00 0 [Note] WSREP: Member 1.0 (pxc-node-383) requested state transfer from 'pxc-node-323'. Selected 2.0 (pxc-node-323)(SYNCED) as donor.
2021-09-24T08:26:24.768157-05:00 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 183905237)
2021-09-24T08:26:24.768183-05:00 2 [Note] WSREP: Requesting state transfer: success, donor: 2
2021-09-24T08:26:24.768198-05:00 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:183905234
	2021-09-24T08:26:25.470549-04:00 WSREP_SST: [INFO] Streaming with xbstream
	2021-09-24T08:26:25.478271-04:00 WSREP_SST: [INFO] Proceeding with SST.........
	2021-09-24T08:26:25.479082-04:00 WSREP_SST: [DEBUG] Evaluating (@ Joiner-Recv-SST) socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio | xbstream -x  $xbstream_eopts; RC=( ${PIPESTATUS[@]} )
	2021-09-24T08:26:25.479804-04:00 WSREP_SST: [DEBUG] Cleaning the existing datadir and innodb-data/log directories
2021-09-24T08:26:26.593214-05:00 0 [Note] WSREP: (a12745b6, 'tcp://0.0.0.0:4567') turning message relay requesting off
	2021-09-24T08:26:30.875401-04:00 WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
2021-09-24T08:38:33.834870-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
<...and so on up to gcache page 201>
2021-09-24T09:51:08.057700-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000202 of size 134217728 bytes
2021-09-24T09:53:48.245501-05:00 0 [Note] WSREP: 2.0 (pxc-node-323): State transfer to 1.0 (pxc-node-383) complete.
2021-09-24T09:53:48.246467-05:00 0 [Note] WSREP: Member 2.0 (pxc-node-323) synced with group.
	2021-09-24T09:53:48.252650-04:00 WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql//.sst
	2021-09-24T09:53:48.254192-04:00 WSREP_SST: [DEBUG] Evaluating (@ Xtrabackup prepare stage) xtrabackup   --use-memory=60G --prepare --binlog-info=ON $rebuildcmd $keyringapplyopt $encrypt_prepare_options --target-dir=${DATA} &>${DATA}/innobackup.prepare.log
2021-09-24T09:53:49.789049-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000203 of size 134217728 bytes
2021-09-24T09:53:52.229286-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000204 of size 134217728 bytes
2021-09-24T09:53:54.131285-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000205 of size 134217728 bytes
2021-09-24T09:53:56.305779-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000206 of size 134217728 bytes
2021-09-24T09:53:59.760846-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000207 of size 134217728 bytes
2021-09-24T09:54:03.866464-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000208 of size 134217728 bytes
2021-09-24T09:54:06.525582-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000209 of size 134217728 bytes
2021-09-24T09:54:10.549530-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000210 of size 134217728 bytes
2021-09-24T09:54:24.472603-05:00 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000211 of size 134217728 bytes
	2021-09-24T09:55:48.322174-04:00 WSREP_SST: [INFO] Moving the backup to /var/lib/mysql/
	2021-09-24T09:55:48.323751-04:00 WSREP_SST: [DEBUG] Evaluating (@ Xtrabackup move stage) xtrabackup --defaults-file=/etc/my.cnf  --defaults-group=mysqld --datadir=${TDATA}    --move-back --binlog-info=ON --force-non-empty-directories $encrypt_move_options --target-dir=${DATA} &>${DATA}/innobackup.move.log
	2021-09-24T09:55:48.436638-04:00 WSREP_SST: [DEBUG] Move successful, removing /var/lib/mysql//.sst
	2021-09-24T09:55:49.050303-04:00 WSREP_SST: [INFO] Galera co-ords from recovery: 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:184303538
	2021-09-24T09:55:49.053005-04:00 WSREP_SST: [DEBUG] Total time on joiner: 0 seconds
	2021-09-24T09:55:49.054914-04:00 WSREP_SST: [DEBUG] Removing the sst_in_progress file
2021-09-24T09:55:49.076409-05:00 0 [Note] WSREP: SST complete, seqno: 184303538
2021-09-24T09:55:49.079249-05:00 0 [Note] InnoDB: PUNCH HOLE support available
2021-09-24T09:55:49.079280-05:00 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2021-09-24T09:55:49.079288-05:00 0 [Note] InnoDB: Uses event mutexes
2021-09-24T09:55:49.079294-05:00 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2021-09-24T09:55:49.079302-05:00 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2021-09-24T09:55:49.079311-05:00 0 [Note] InnoDB: Using Linux native AIO
2021-09-24T09:55:49.080118-05:00 0 [Note] InnoDB: Number of pools: 1
2021-09-24T09:55:49.080247-05:00 0 [Note] InnoDB: Using CPU crc32 instructions
2021-09-24T09:55:49.082661-05:00 0 [Note] InnoDB: Initializing buffer pool, total size = 60G, instances = 12, chunk size = 128M
2021-09-24T09:55:50.610824-05:00 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-24T09:55:50.970716-05:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2021-09-24T09:55:50.987084-05:00 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2021-09-24T09:55:50.988485-05:00 0 [Note] InnoDB: Highest supported file format is Barracuda.
2021-09-24T09:55:51.423772-05:00 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 47185920 bytes
2021-09-24T09:55:52.217240-05:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2021-09-24T09:55:52.217272-05:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-24T09:55:52.217311-05:00 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-24T09:55:52.245190-05:00 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-24T09:55:52.245841-05:00 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2021-09-24T09:55:52.245861-05:00 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2021-09-24T09:55:52.249222-05:00 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.32-35 started; log sequence number 14829582025768
2021-09-24T09:55:52.249335-05:00 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2021-09-24T09:55:52.249541-05:00 0 [Note] Plugin 'FEDERATED' is disabled.
2021-09-24T09:55:52.257326-05:00 0 [Note] WSREP: Initial TC log open: dummy
2021-09-24T09:55:52.257580-05:00 0 [Note] Salting uuid generator variables, current_pid: 27054, server_start_time: 1632486383, bytes_sent: 0, 
2021-09-24T09:55:52.257633-05:00 0 [Note] Generated uuid: '216cf3c3-1d3f-11ec-828a-0a46f662ba59', server_start_time: 7615024021562600861, bytes_sent: 54703856
2021-09-24T09:55:52.257655-05:00 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 216cf3c3-1d3f-11ec-828a-0a46f662ba59.
2021-09-24T09:55:52.260591-05:00 0 [Note] WSREP: SYSTEM_THREAD_COMPRESS_GTID_TABLE declared  as non wsrep_on
2021-09-24T09:55:52.262674-05:00 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2021-09-24T09:55:52.262695-05:00 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2021-09-24T09:55:52.263200-05:00 0 [Warning] CA certificate ca.pem is self signed.
2021-09-24T09:55:52.263250-05:00 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2021-09-24T09:55:52.263348-05:00 0 [Note] Server hostname (bind-address): '*'; port: 3306
2021-09-24T09:55:52.264990-05:00 0 [Note] IPv6 is not available.
2021-09-24T09:55:52.265004-05:00 0 [Note]   - '0.0.0.0' resolves to '0.0.0.0';
2021-09-24T09:55:52.265025-05:00 0 [Note] Server socket created on IP: '0.0.0.0'.
2021-09-24T09:55:52.270563-05:00 0 [Note] WSREP: cleanup transaction for LOCAL_STATE: (null)
2021-09-24T09:55:52.302348-05:00 0 [Note] Event Scheduler: Loaded 0 events
2021-09-24T09:55:52.302448-05:00 0 [Note] WSREP: Signalling provider to continue on SST completion.
2021-09-24T09:55:52.302468-05:00 0 [Note] WSREP: Read WSREPXid (InnoDB): 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:184303538
2021-09-24T09:55:52.302483-05:00 0 [Note] WSREP: Initialized wsrep sidno 2
2021-09-24T09:55:52.302503-05:00 0 [Note] WSREP: SST received: 727b6fd8-f9e4-11eb-b574-6ae063fe35b9:184303538
2021-09-24T09:55:52.302701-05:00 3 [Note] WSREP: wsrep running threads now: 3
2021-09-24T09:55:52.302806-05:00 4 [Note] WSREP: wsrep running threads now: 4
2021-09-24T09:55:52.302916-05:00 5 [Note] WSREP: wsrep running threads now: 5
2021-09-24T09:55:52.302983-05:00 8 [Note] WSREP: wsrep running threads now: 6
2021-09-24T09:55:52.303001-05:00 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.32-35-57'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release rel35, Revision 2055835, WSREP version 31.47, wsrep_31.47
2021-09-24T09:55:52.303014-05:00 6 [Note] WSREP: wsrep running threads now: 7
2021-09-24T09:55:52.303005-05:00 7 [Note] WSREP: Compress GTID table thread will run with wsrep_on=0/false
2021-09-24T09:55:52.303052-05:00 9 [Note] WSREP: wsrep running threads now: 8
2021-09-24T09:55:52.303100-05:00 10 [Note] WSREP: wsrep running threads now: 9
2021-09-24T09:55:52.308424-05:00 0 [Note] WSREP: 1.0 (pxc-node-383): State transfer from 2.0 (pxc-node-323) complete.
2021-09-24T09:55:52.308448-05:00 0 [Note] WSREP: SST leaving flow control
2021-09-24T09:55:52.308457-05:00 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 184316203)
13:55:52 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://jira.percona.com/projects/PXC/issues

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=0
max_threads=501
thread_count=9
connection_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 232161 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.


2021-09-24T09:55:52.906019-05:00 0 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
Build ID: 591c8fe13efc657ca532af3e8a3bf148c55f8cb8
Server Version: 5.7.32-35-57 Percona XtraDB Cluster (GPL), Release rel35, Revision 2055835, WSREP version 31.47, wsrep_31.47

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf5aa9b]
/usr/sbin/mysqld(handle_fatal_signal+0x505)[0xd6e325]
/lib64/libpthread.so.0(+0xf630)[0x7f7a73fcb630]
/usr/lib64/galera3/libgalera_smm.so(+0x22d53)[0x7f7a6742dd53]
/usr/lib64/galera3/libgalera_smm.so(+0x4e9c7)[0x7f7a674599c7]
/usr/lib64/galera3/libgalera_smm.so(+0x16866f)[0x7f7a6757366f]
/usr/lib64/galera3/libgalera_smm.so(+0x169332)[0x7f7a67574332]
/lib64/libpthread.so.0(+0x7ea5)[0x7f7a73fc3ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f7a721909fd]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

Config file: (ip addresses masked)

[mysqld]
log_bin_trust_function_creators = 1
thread_stack = 524288

# GENERAL #
basedir                        = /usr/
user                           = mysql
default-storage-engine         = InnoDB
socket                         = /var/run/mysqld/mysqld.sock
pid-file                       = /var/run/mysqld/mysqld.pid
port                           = 3306

# MyISAM #
key-buffer-size                = 32M
myisam-recover-options         = FORCE,BACKUP

# SAFETY #
max-allowed-packet             = 128M
max-connect-errors             = 1000000

# DATA STORAGE #
datadir                        = /var/lib/mysql/

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
query-cache-type               = 0
query-cache-size               = 0
max-connections                = 500
thread-cache-size              = 128
open-files-limit               = 65535
table-definition-cache         = 1024
table-open-cache               = 8000
max_prepared_stmt_count        = 1048576
group_concat_max_len           = 1048576

# INNODB #
innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 1024M
innodb-flush-log-at-trx-commit = 2
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 90G 
innodb_buffer_pool_instances   = 12
innodb_autoinc_lock_mode       = 2

# LOGGING #
log-timestamps                 = SYSTEM
log-error                      = /var/log/mysql-error.log
log-queries-not-using-indexes  = 0
slow-query-log-file            = /var/log/mysql-slow.log
slow-query-log                 = 0

# REPLICATION #
binlog_format                  = ROW
sync_binlog                    = 0

# PERF #
performance_schema             = ON

default_tmp_storage_engine     = MyISAM

# wsrep settings
wsrep_provider                 = /usr/lib64/galera3/libgalera_smm.so
wsrep_cluster_address          = gcomm://<node1-ip-address>,<node2-ip-address>,<node3-ip-address>
wsrep_log_conflicts
wsrep_cluster_name             = pxc-cluster
wsrep_node_name                = pxc-node-383
pxc_strict_mode                = ENFORCING
wsrep_sst_method               = xtrabackup-v2
wsrep_sst_auth                 = "sstuser:<hidden>"
wsrep_slave_threads            = 8
wsrep_sync_wait                = 0
wsrep_provider_options         = "gcache.size=4G"
wsrep_retry_autocommit         = 10
wsrep_debug                    = ON

[mysqldump]
quick
quote-names
max_allowed_packet             = 16M

[client]
default-character-set=utf8
port                           = 3306
socket                         = /var/run/mysqld/mysqld.sock

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
socket                         = /var/run/mysqld/mysqld.sock
nice                           = 0

[mysql]

[isamchk]

thanks in advance,

Neil

1 Like

Have you checked dmesg to make sure MySQL is not being killed by OOM?

Have you tried a manual SST process? (Use xtrabackup on another node, stream to this node, prepare, create grastate.dat based on UUID:SEQNO produced by xtrabackup).

1 Like

Thanks @matthewb for the response.

Yes we’re confident the node is not being OOM killed on subsequent attempts to restart (based on /var/log/messages) and was only OOM killed in the original instance which lead to the predicament we’re now in.

Haven’t tried your manual SST suggestion but will look into it - thank you.

From watching it occur a few times, it “feels” like the SST is successful but it is the point after that, where the node opens for business, that it always falls over. Caveat obviously that I’m no expert on what goes on behind the scenes when a node joins the cluster :slight_smile:

thanks,

Neil

1 Like

After the SST process completes (which is basically just an xtrabackup --stream), the backup is prepared (using xtrabackup) and mysql is started. The SST process itself is just a backup/clone of another node.

1 Like

An update on this one. We finally managed to get the node to start and join the cluster but only by allowing it to fail as before, then quickly deleting all of the /var/lib/mysql/gcache.page.<number> files and then restarting it again. It was then able to do an IST and join the cluster.

So the crash seemed somehow related to the deletion of these gcache files, perhaps as hinted at by the appearance of these sort of log lines in amongst the crash logging:

2021-09-24T09:55:52.906019-05:00 0 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000

Does anyone have any thoughts on this? As mentioned before this cluster runs on AWS instances (m5.8xlarge) and there are various mounted volumes - one of which being /var/lib/mysql

Worth a bug report?

many thanks,

Neil

2 Likes

Hi @NeilBillett ,
I found this, possibly, related github issue gcache.page. files not removed fast enough when many big transactions happen in short time · Issue #317 · codership/galera · GitHub but it should have been fixed a while ago. You’re not filling up disk space during the SST process are you?

I would say it’s only worth opening a bug report if you can also capture the core dump, or create a reproducible scenario that the dev team can use.

1 Like