Percona operator 0.8: Can't recover from OOM crash

Slavik_Fursov · July 27, 2024, 4:02pm

Description:

One mysqld process crashed with OOM (out of memory), then cluster crashed, and now it can’t recover.

Steps to Reproduce:

I have Kubernetes cluster

Harvester 1.3.1
2 baremetal nodes (32 cores Xeon Gold 5218, 128GB RAM, NVMe discs, 10Gbps Ethernet) which are management and workers and 1 witness node
Installed “Percona Operator for MySQL based on Percona Server” 0.8.0:

helm install percona-op percona/ps-operator --namespace percona --create-namespace
helm install db3 percona/ps-db --namespace percona \
 --set unsafeFlags.mysqlSize=true \
 --set mysql.size=2 \
 --set mysql.volumeSpec.pvc.storageClassName=ssd-1local \
 --set mysql.volumeSpec.pvc.resources.requests.storage=20Gi \
 --set proxy.haproxy.size=2

I know, that 2 nodes are not safe config, but everything worked fine for few days, but today I found that Percona cluster in the error state:

kubectl get  ps db3-ps-db -n percona 
NAME        REPLICATION         ENDPOINT                    STATE   MYSQL   ORCHESTRATOR   HAPROXY   ROUTER   AGE
db3-ps-db   group-replication   db3-ps-db-haproxy.percona   error   2                      2                  17d

# looking at events:
Events:
  Type     Reason                    Age                    From           Message
  ----     ------                    ----                   ----           -------
  Warning  FullClusterCrashDetected  3m9s (x5271 over 14h)  ps-controller  Full cluster crash detected

Logs:

While investigating, I found that mysqld process was killed by Kubernetes on one (only one) of the host:

t7920:/home/rancher # dmesg -T

[Sat Jul 27 01:10:03 2024] gcs_xcom invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=997
[Sat Jul 27 01:10:03 2024] CPU: 0 PID: 15334 Comm: gcs_xcom Tainted: G               X    5.14.21-150400.24.119-default #1 SLE15-SP4 72b78b4bcd575b4ef23e194de811a4f54afb0381
[Sat Jul 27 01:10:03 2024] Hardware name: Dell Inc. Precision 7920 Tower/060K5C, BIOS 2.1.4 07/29/2019
[Sat Jul 27 01:10:03 2024] Call Trace:
[Sat Jul 27 01:10:03 2024]  <TASK>
[Sat Jul 27 01:10:03 2024]  dump_stack_lvl+0x45/0x5b
[Sat Jul 27 01:10:03 2024]  dump_header+0x4a/0x220
[Sat Jul 27 01:10:03 2024]  oom_kill_process+0xe8/0x140
[Sat Jul 27 01:10:03 2024]  out_of_memory+0x113/0x580
[Sat Jul 27 01:10:03 2024]  mem_cgroup_out_of_memory+0xe3/0x100
[Sat Jul 27 01:10:03 2024]  try_charge_memcg+0x6bb/0x700
[Sat Jul 27 01:10:03 2024]  ? __alloc_pages+0x180/0x320
[Sat Jul 27 01:10:03 2024]  charge_memcg+0x40/0xa0
[Sat Jul 27 01:10:03 2024]  __mem_cgroup_charge+0x2c/0xa0
[Sat Jul 27 01:10:03 2024]  __handle_mm_fault+0xa37/0x1220
[Sat Jul 27 01:10:03 2024]  handle_mm_fault+0xd5/0x290
[Sat Jul 27 01:10:03 2024]  do_user_addr_fault+0x1eb/0x730
[Sat Jul 27 01:10:03 2024]  ? do_syscall_64+0x67/0x80
[Sat Jul 27 01:10:03 2024]  exc_page_fault+0x67/0x150
[Sat Jul 27 01:10:03 2024]  asm_exc_page_fault+0x59/0x60
[Sat Jul 27 01:10:03 2024] RIP: 0033:0x7f02b055970a
[Sat Jul 27 01:10:03 2024] Code: 31 c9 48 8d 34 2a 48 39 fb 48 89 73 60 4c 8d 42 10 0f 95 c1 48 29 e8 48 c1 e1 02 48 83 c8 01 48 09 e9 48 83 c9 01 48 89 4a 08 <48> 89 46 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[Sat Jul 27 01:10:03 2024] RSP: 002b:00007f01fe7fae00 EFLAGS: 00010202
[Sat Jul 27 01:10:03 2024] RAX: 0000000000000fc1 RBX: 00007f01bc000020 RCX: 0000000000000115
[Sat Jul 27 01:10:03 2024] RDX: 00007f014c234f30 RSI: 00007f014c235040 RDI: 00007f02b06a6c80
[Sat Jul 27 01:10:03 2024] RBP: 0000000000000110 R08: 00007f014c234f40 R09: 0000000000236000
[Sat Jul 27 01:10:03 2024] R10: 0000000000000130 R11: 0000000000000206 R12: 00000000000000d0
[Sat Jul 27 01:10:03 2024] R13: 0000000000001000 R14: 00007f014c234f30 R15: 0000000000000130
[Sat Jul 27 01:10:03 2024]  </TASK>
[Sat Jul 27 01:10:03 2024] memory: usage 1500000kB, limit 1500000kB, failcnt 0
[Sat Jul 27 01:10:03 2024] memory+swap: usage 1500000kB, limit 1500000kB, failcnt 20703
[Sat Jul 27 01:10:03 2024] kmem: usage 7192kB, limit 9007199254740988kB, failcnt 0
[Sat Jul 27 01:10:03 2024] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf366b2e3_b354_4eab_af12_bdd6dee78c36.slice/cri-containerd-ec228c92ac8549ead3673ff4bb83ef44c8b41fe812942e9b72a38cb09918bd7d.scope:
[Sat Jul 27 01:10:03 2024] anon 1528438784
                           file 196608
                           kernel_stack 1064960
                           pagetables 3567616
                           percpu 0
                           sock 0
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 1436549120
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 1528434688
                           active_anon 4096
                           inactive_file 196608
                           active_file 0
                           unevictable 0
                           slab_reclaimable 1659808
                           slab_unreclaimable 1039216
                           slab 2699024
                           workingset_refault_anon 0
                           workingset_refault_file 29562
                           workingset_activate_anon 0
                           workingset_activate_file 2884
                           workingset_restore_anon 0
                           workingset_restore_file 1905
                           workingset_nodereclaim 0
                           pgfault 268222950
                           pgmajfault 426
                           pgrefill 13597
                           pgscan 104748
                           pgsteal 101384
                           pgactivate 40159
                           pgdeactivate 13520
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 314424
                           thp_collapse_alloc 529
[Sat Jul 27 01:10:03 2024] Tasks state (memory values in pages):
[Sat Jul 27 01:10:03 2024] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Sat Jul 27 01:10:03 2024] [  15059]  1001 15059  1495894   379550  3579904        0           997 mysqld
[Sat Jul 27 01:10:03 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-ec228c92ac8549ead3673ff4bb83ef44c8b41fe812942e9b72a38cb09918bd7d.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf366b2e3_b354_4eab_af12_bdd6dee78c36.slice/cri-containerd-ec228c92ac8549ead3673ff4bb83ef44c8b41fe812942e9b72a38cb09918bd7d.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf366b2e3_b354_4eab_af12_bdd6dee78c36.slice/cri-containerd-ec228c92ac8549ead3673ff4bb83ef44c8b41fe812942e9b72a38cb09918bd7d.scope,task=mysqld,pid=15059,uid=1001
[Sat Jul 27 01:10:03 2024] Memory cgroup out of memory: Killed process 15059 (mysqld) total-vm:5983576kB, anon-rss:1490300kB, file-rss:27900kB, shmem-rss:0kB, UID:1001 pgtables:3496kB oom_score_adj:997
[Sat Jul 27 01:10:03 2024] oom_reaper: reaped process 15059 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

After cluster killed the process, here is what I see in the Percona node logs:

kubectl logs -n percona db3-ps-db-mysql-0

2024-07-27T01:11:56.758865Z 14 [Note] [MY-010581] [Repl] Replica SQL thread for channel 'group_replication_applier' initialized, starting replication in log 'INVALID' at position 0, relay log './db3-ps-db-mysql-0-relay-bin-group_replication_applier.000002' position: 4
2024-07-27T01:11:56.882234Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using MySQL as Communication Stack for XCom'
2024-07-27T01:11:56.882446Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Successfully connected to the local XCom via anonymous pipe'
2024-07-27T01:11:56.962547Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2024-07-27T01:11:56.962594Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node db3-ps-db-mysql-1.db3-ps-db-mysql.percona:3306 when joining a group. My local port is: 3306.'
...
2024-07-27T01:11:56.999063Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-07-27T01:11:57.074696Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-07-27T01:11:57.074753Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sleeping for 5 seconds before retrying to join the group. There are 9 more attempt(s) before giving up.'
...

2024-07-27T01:12:49.678061Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Request failed: maximum number of retries (10) has been exhausted.'
2024-07-27T01:12:49.678122Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Failed to send add_node request to a peer XCom node.'
2024-07-27T01:12:49.701341Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] This node received a Configuration change request, but it not yet started. This could happen if one starts several nodes simultaneously. This request will be retried by whoever sent it.'
2024-07-27T01:12:49.738710Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] TCP_NODELAY already set'
2024-07-27T01:12:49.738739Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully connected to peer db3-ps-db-mysql-1.db3-ps-db-mysql.percona:3306. Sending a request to be added to the group'
2024-07-27T01:12:49.738752Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending add_node request to a peer XCom node'
2024-07-27T01:12:49.739435Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Retrying a request to a remote XCom. Please check the remote node log for more details.'
2024-07-27T01:12:50.702047Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] This node received a Configuration change request, but it not yet started. This could happen if one starts several nodes simultaneously. This request will be retried by whoever sent it.'
...
2024-07-27T01:12:56.745411Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Retrying a request to a remote XCom. Please check the remote node log for more details.'
2024-07-27T01:12:56.759277Z 4 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2024-07-27T01:12:56.759348Z 4 [Note] [MY-011649] [Repl] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2024-07-27T01:12:56.759387Z 4 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2024-07-27T01:12:57.708716Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] This node received a Configuration change request, but it not yet started. This could happen if one starts several nodes simultaneously. This request will be retried by whoever sent it.'
2024-07-27T01:12:57.746136Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Retrying a request to a remote XCom. Please check the remote node log for more details.'
2024-07-27T01:12:58.709674Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] This node received a Configuration change request, but it not yet started. This could happen if one starts several nodes simultaneously. This request will be retried by whoever sent it.'
2024-07-27T01:12:58.747064Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Retrying a request to a remote XCom. Please check the remote node log for more details.'
2024-07-27T01:12:59.747244Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Request failed: maximum number of retries (10) has been exhausted.'
2024-07-27T01:12:59.747321Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Failed to send add_node request to a peer XCom node.'
2024-07-27T01:12:59.751200Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 3306'
2024-07-27T01:12:59.831740Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 3306'
2024-07-27T01:12:59.831800Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sleeping for 5 seconds before retrying to join the group. There are 3 more attempt(s) before giving up.'
2024-07-27T01:13:04.833041Z 14 [Note] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': replica SQL thread was killed
2024-07-27T01:13:04.833449Z 14 [Note] [MY-010587] [Repl] Replica SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 0
2024-07-27T01:13:04.833626Z 12 [Note] [MY-011444] [Repl] Plugin group_replication reported: 'The group replication applier thread was killed.'

and Percona operator logs:

kubectl logs -n percona percona-op-ps-operator-7b4d5d6d8c-cmnbk

2024-07-27T01:12:50.677Z	INFO	groupReplicationStatus.db3-ps-db-mysql-0.db3-ps-db-mysql.percona	Member is not ONLINE	{"controller": "ps-controller", "controllerGroup": "ps.percona.com", "controllerKind": "PerconaServerMySQL", "PerconaServerMySQL": {"name":"db3-ps-db","namespace":"percona"}, "namespace": "percona", "name": "db3-ps-db", "reconcileID": "54e6e0b9-517b-45b0-aa59-454f545098df", "state": "OFFLINE"}  
2024-07-27T01:12:50.997Z	INFO	Crash recovery	Pod is waiting for recovery	{"controller": "ps-controller", "controllerGroup": "ps.percona.com", "controllerKind": "PerconaServerMySQL", "PerconaServerMySQL": {"name":"db3-ps-db","namespace":"percona"}, "namespace": "percona", "name": "db3-ps-db", "reconcileID": "21b59b9d-b37d-4a79-9d29-f852a38ade8c", "pod": "db3-ps-db-mysql-0", "gtidExecuted": "c23bdefa-3e77-11ef-ac3d-32605eb3ef85:1-4,cd68adda-3e77-11ef-a089-32605eb3ef85:1-5239:1000386-1000395,cd68b3ae-3e77-11ef-a089-32605eb3ef85:1-34"}

The resource settings I had for mysql:

    resources:
      limits:
        memory: 1536M
      requests:
        memory: 512M

So, I increased RAM limit to 3G with this command:
kubectl edit ps -n percona db3-ps-db

    resources:
      limits:
        memory: 3G
      requests:
        memory: 1G

But that change didn’t restart the cluster.

So, I restarted it by setting spec.pause to true, waiting one minute and back to false:

kubectl patch ps db3-ps-db -n percona --type='merge' -p '{"spec":{"pause":true}}'

kubectl patch ps db3-ps-db -n percona --type='merge' -p '{"spec":{"pause":false}}'

Cluster restarted, but getting to error state again.

What else can I do to try to fix it?

Slavik_Fursov · July 27, 2024, 9:19pm

As I’m trying to debug, here is what I see in logs now:

2024-07-27T16:08:14.254650Z 1 [Note] [MY-013883] [InnoDB] The latest found checkpoint is at lsn = 171576966 in redo log file ./#innodb_redo/#ib_redo33.
2024-07-27T16:08:14.256181Z 1 [Note] [MY-013086] [InnoDB] Starting to parse redo log at lsn = 171576856, whereas checkpoint_lsn = 171576966 and start_lsn = 171576832
2024-07-27T16:08:14.262327Z 1 [Note] [MY-013083] [InnoDB] Log background threads are being started...
2024-07-27T16:08:14.263003Z 1 [Note] [MY-012532] [InnoDB] Applying a batch of 0 redo log records ...
2024-07-27T16:08:14.263019Z 1 [Note] [MY-012535] [InnoDB] Apply batch completed!
2024-07-27T16:08:14.263217Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_001'.
2024-07-27T16:08:14.264412Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_002'.
2024-07-27T16:08:14.265816Z 1 [Note] [MY-012910] [InnoDB] Opened 2 existing undo tablespaces.
2024-07-27T16:08:14.265882Z 1 [Note] [MY-011980] [InnoDB] GTID recovery trx_no: 34860
2024-07-27T16:08:14.605658Z 1 [Note] [MY-013776] [InnoDB] Parallel initialization of rseg complete
2024-07-27T16:08:14.605728Z 1 [Note] [MY-013777] [InnoDB] Time taken to initialize rseg using 4 thread: 339849 ms.
2024-07-27T16:08:14.605874Z 1 [Note] [MY-012923] [InnoDB] Creating shared tablespace for temporary tables
2024-07-27T16:08:14.605994Z 1 [Note] [MY-012265] [InnoDB] Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2024-07-27T16:08:14.717769Z 1 [Note] [MY-012266] [InnoDB] File './ibtmp1' size is now 12 MB.
2024-07-27T16:08:14.717915Z 1 [Note] [MY-013627] [InnoDB] Scanning temp tablespace dir:'./#innodb_temp/'
2024-07-27T16:08:14.746484Z 1 [Note] [MY-013018] [InnoDB] Created 128 and tracked 128 new rollback segment(s) in the temporary tablespace. 128 are now active.
2024-07-27T16:08:14.746693Z 1 [Note] [MY-012976] [InnoDB] Percona XtraDB (http://www.percona.com) 8.0.36-28 started; log sequence number 171576976
2024-07-27T16:08:14.747832Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2024-07-27T16:08:14.762046Z 1 [Note] [MY-011089] [Server] Data dictionary restarting version '80023'.
2024-07-27T16:08:14.972896Z 1 [Note] [MY-012357] [InnoDB] Reading DD tablespace files
2024-07-27T16:08:14.991452Z 1 [Note] [MY-012356] [InnoDB] Scanned 144 tablespaces. Validated 144.
2024-07-27T16:08:15.006008Z 1 [Note] [MY-010006] [Server] Using data dictionary with version '80023'.
2024-07-27T16:08:15.029400Z 0 [Note] [MY-011332] [Server] Plugin mysqlx reported: 'IPv6 is available'
2024-07-27T16:08:15.031682Z 0 [Note] [MY-011323] [Server] Plugin mysqlx reported: 'X Plugin ready for connections. bind-address: '::' port: 33060'
2024-07-27T16:08:15.031717Z 0 [Note] [MY-011323] [Server] Plugin mysqlx reported: 'X Plugin ready for connections. socket: '/var/lib/mysql/mysqlx.sock''
2024-07-27T16:08:15.042377Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2024-07-27T16:08:15.042428Z 0 [Note] [MY-011716] [Repl] Plugin group_replication reported: 'Current debug options are: 'GCS_DEBUG_NONE'.'
2024-07-27T16:08:15.042451Z 0 [Warning] [MY-011069] [Server] The syntax 'group_replication_view_change_uuid' is deprecated and will be removed in a future release.
2024-07-27T16:08:15.042706Z 0 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'
2024-07-27T16:08:15.116612Z 0 [Note] [MY-010902] [Server] Thread priority attribute setting in Resource Group SQL shall be ignored due to unsupported platform or insufficient privilege.
2024-07-27T16:08:15.123144Z 0 [Note] [MY-010856] [Server] Failed to open the crashed binlog file when source server is recovering it.
2024-07-27T16:08:15.131734Z 0 [Note] [MY-013911] [Server] Crash recovery finished in binlog engine. No attempts to commit, rollback or prepare any transactions.
2024-07-27T16:08:15.131781Z 0 [Note] [MY-013911] [Server] Crash recovery finished in InnoDB engine. No attempts to commit, rollback or prepare any transactions.
2024-07-27T16:08:15.136806Z 0 [Note] [MY-012487] [InnoDB] DDL log recovery : begin
2024-07-27T16:08:15.136893Z 0 [Note] [MY-012488] [InnoDB] DDL log recovery : end
2024-07-27T16:08:15.137212Z 0 [Note] [MY-011946] [InnoDB] Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-07-27T16:08:15.201106Z 0 [Note] [MY-012922] [InnoDB] Waiting for purge to start
2024-07-27T16:08:15.213687Z 0 [Note] [MY-011946] [InnoDB] Buffer pool(s) load completed at 240727 16:08:15
2024-07-27T16:08:15.277320Z 0 [Note] [MY-010303] [Server] Skipping generation of SSL certificates as options related to SSL are specified.
2024-07-27T16:08:15.281590Z 0 [Warning] [MY-010068] [Server] CA certificate /etc/mysql/mysql-tls-secret/ca.crt is self signed.
2024-07-27T16:08:15.281623Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2024-07-27T16:08:15.327304Z 0 [Warning] [MY-013595] [Server] Failed to initialize TLS for channel: mysql_admin. See below for the description of exact issue.
2024-07-27T16:08:15.327326Z 0 [Warning] [MY-010069] [Server] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2024-07-27T16:08:15.327334Z 0 [System] [MY-013603] [Server] No TLS configuration was given for channel mysql_admin; re-using TLS configuration of channel mysql_main.
2024-07-27T16:08:15.327748Z 0 [Note] [MY-010308] [Server] Skipping generation of RSA key pair through --sha256_password_auto_generate_rsa_keys as key files are present in data directory.
2024-07-27T16:08:15.328078Z 0 [Note] [MY-010308] [Server] Skipping generation of RSA key pair through --caching_sha2_password_auto_generate_rsa_keys as key files are present in data directory.
2024-07-27T16:08:15.332658Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.52.1.56'; port: 33062
2024-07-27T16:08:15.332676Z 0 [Note] [MY-010264] [Server]   - '10.52.1.56' resolves to '10.52.1.56';
2024-07-27T16:08:15.332691Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '10.52.1.56'.
2024-07-27T16:08:15.332716Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '*'; port: 3306
2024-07-27T16:08:15.332734Z 0 [Note] [MY-010253] [Server] IPv6 is available.
2024-07-27T16:08:15.332741Z 0 [Note] [MY-010264] [Server]   - '::' resolves to '::';
2024-07-27T16:08:15.332747Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '::'.
2024-07-27T16:08:15.372542Z 0 [Warning] [MY-010604] [Repl] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a replica and has his hostname changed!! Please use '--relay-log=db3-ps-db-mysql-0-relay-bin' to avoid this problem.
2024-07-27T16:08:15.406742Z 0 [Note] [MY-011025] [Repl] Failed to start replica threads for channel ''.
2024-07-27T16:08:15.410223Z 9 [Warning] [MY-013360] [Server] '@@binlog_transaction_dependency_tracking' is deprecated and will be removed in a future release.
2024-07-27T16:08:15.410266Z 8 [Note] [MY-010051] [Server] Event Scheduler: scheduler thread started with id 8
2024-07-27T16:08:15.410424Z 0 [Note] [MY-011240] [Server] Plugin mysqlx reported: 'Using SSL configuration from MySQL Server'
2024-07-27T16:08:15.416103Z 0 [Note] [MY-011243] [Server] Plugin mysqlx reported: 'Using OpenSSL for TLS connections'
2024-07-27T16:08:15.416306Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock
2024-07-27T16:08:15.416449Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.36-28'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 28, Revision 47601f19.
2024-07-27T16:08:15.416476Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '10.52.1.56'  port: 33062
2024-07-27T16:08:15.417080Z 4 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2024-07-27T16:08:15.417217Z 4 [Note] [MY-011671] [Repl] Plugin group_replication reported: 'Group communication SSL configuration: group_replication_ssl_mode: "REQUIRED"; server_key_file: ""; server_cert_file: ""; client_key_file: ""; client_cert_file: ""; ca_file: ""; ca_path: ""; cipher: ""; tls_version: "TLSv1.2,TLSv1.3"; tls_ciphersuites: "NOT_SET"; crl_file: ""; crl_path: ""; ssl_fips_mode: ""'
2024-07-27T16:08:15.420274Z 4 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Debug messages will be sent to: asynchronous::/var/lib/mysql/GCS_DEBUG_TRACE'
2024-07-27T16:08:15.420591Z 4 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Added automatically IP ranges 10.52.1.56/32,127.0.0.1/8,::1/128,fe80::386f:6aff:fe01:cb2c/64 to the allowlist'
2024-07-27T16:08:15.427378Z 4 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Translated 'db3-ps-db-mysql-0.db3-ps-db-mysql.percona' to 10.52.1.56'
2024-07-27T16:08:15.427426Z 4 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Translated 'db3-ps-db-mysql-0.db3-ps-db-mysql.percona' to 10.52.1.56'
2024-07-27T16:08:15.427444Z 4 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Translated 'db3-ps-db-mysql-0.db3-ps-db-mysql.percona' to 10.52.1.56'
2024-07-27T16:08:15.430317Z 4 [Note] [MY-011694] [Repl] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: 'cd68adda-3e77-11ef-a089-32605eb3ef85'; group_replication_local_address: 'db3-ps-db-mysql-0.db3-ps-db-mysql.percona:3306'; group_replication_group_seeds: 'db3-ps-db-mysql-0.db3-ps-db-mysql.percona:3306,db3-ps-db-mysql-1.db3-ps-db-mysql.percona:3306'; group_replication_bootstrap_group: 'false'; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_allowlist: 'AUTOMATIC'; group_replication_communication_debug_options: 'GCS_DEBUG_NONE'; group_replication_member_expel_timeout: '5'; group_replication_communication_max_message_size: 10485760; group_replication_message_cache_size: '1073741824u; group_replication_communication_stack: '1''
2024-07-27T16:08:15.430385Z 4 [Note] [MY-011643] [Repl] Plugin group_replication reported: 'Member configuration: member_id: 26047620; member_uuid: "c23bdefa-3e77-11ef-ac3d-32605eb3ef85"; single-primary mode: "true"; group_replication_auto_increment_increment: 7; group_replication_view_change_uuid: "cd68b3ae-3e77-11ef-a089-32605eb3ef85";'
2024-07-27T16:08:15.431258Z 12 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_applier' executed'. Previous state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''.
2024-07-27T16:08:15.448862Z 4 [Note] [MY-011670] [Repl] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2024-07-27T16:08:15.449107Z 14 [Note] [MY-010581] [Repl] Replica SQL thread for channel 'group_replication_applier' initialized, starting replication in log 'INVALID' at position 0, relay log './db3-ps-db-mysql-0-relay-bin-group_replication_applier.000002' position: 4
2024-07-27T16:08:15.575758Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using MySQL as Communication Stack for XCom'
2024-07-27T16:08:15.575935Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Successfully connected to the local XCom via anonymous pipe'
2024-07-27T16:08:18.667267Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2024-07-27T16:08:18.667347Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node db3-ps-db-mysql-0.db3-ps-db-mysql.percona:3306 when joining a group. My local port is: 3306.'
2024-07-27T16:08:18.740663Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2024-07-27T16:08:18.742390Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node db3-ps-db-mysql-1.db3-ps-db-mysql.percona:3306 when joining a group. My local port is: 3306.'
2024-07-27T16:08:18.794047Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] TCP_NODELAY already set'
2024-07-27T16:08:18.794074Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully connected to peer db3-ps-db-mysql-0.db3-ps-db-mysql.percona:3306. Sending a request to be added to the group'
2024-07-27T16:08:18.794087Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending add_node request to a peer XCom node'
2024-07-27T16:08:18.868735Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending a request to a remote XCom failed. Please check the remote node log for more details.'
2024-07-27T16:08:18.868816Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Failed to send add_node request to a peer XCom node.'
2024-07-27T16:08:18.924554Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2024-07-27T16:08:18.926570Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node db3-ps-db-mysql-1.db3-ps-db-mysql.percona:3306 when joining a group. My local port is: 3306.'

And here is what I see in DB:

kubectl -n percona exec -ti db3-ps-db-mysql-0 -- mysql -uroot -p"$ROOT_PASSWORD" -e 'SELECT * FROM performance_schema.replication_group_members;'
Defaulted container "mysql" out of: mysql, xtrabackup, mysql-init (init)
+---------------------------+--------------------------------------+-------------------------------------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST                               | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------------------------------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | c23bdefa-3e77-11ef-ac3d-32605eb3ef85 | db3-ps-db-mysql-0.db3-ps-db-mysql.percona |        3306 | OFFLINE      |             |                | MySQL                      |
+---------------------------+--------------------------------------+-------------------------------------------+-------------+--------------+-------------+----------------+----------------------------+


t7920:/home/rancher # kubectl -n percona exec -ti db3-ps-db-mysql-1 -- mysql -uroot -p"$ROOT_PASSWORD" -e 'SELECT * FROM performance_schema.replication_group_members;'
Defaulted container "mysql" out of: mysql, xtrabackup, mysql-init (init)
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier |           |             |        NULL | OFFLINE      |             |                |                            |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+

Topic		Replies	Views
MySQL instance is in CrashLoopBackOff after killing pod Percona Operator for MySQL	0	108	January 27, 2025
Operator fails to rejoin crashed nodes to cluster without deleting it manually Percona Operator for MySQL	3	176	December 19, 2024
Unable to configure Percona MySQL group replication cluster on K8s with version v0.6.0 Percona Operator for MySQL	4	593	February 5, 2024
Percona xtradb cluster crashes periodically Percona Operator for MySQL	3	1356	July 12, 2021
Percona mysql crash recovery times out (k8s) Percona Operator for MySQL closed-no-reply	0	538	January 10, 2022

Percona operator 0.8: Can't recover from OOM crash

Description:

Steps to Reproduce:

Logs:

Related topics