One of the pxc pods is getting OOM Killed on startup (and I assume recovery)

mtye · November 20, 2023, 5:41pm

Openshift 4.12 (kubernetes 1.25)
Operator version is 1.12 (we’re about to upgrade to 1.13)

{"log":"2023-11-20T17:22:01.850635Z 0 [Note] [MY-012487] [InnoDB] DDL log recovery : begin\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.850973Z 0 [Note] [MY-012488] [InnoDB] DDL log recovery : end\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.854652Z 0 [Note] [MY-012922] [InnoDB] Waiting for purge to start\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.905220Z 0 [Note] [MY-000000] [WSREP] Recovered position: 9d49aa78-10b4-11ee-accf-4b4f95de788c:50379049\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.906678Z 0 [Note] [MY-012330] [InnoDB] FTS optimize thread exiting.\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:02.793637Z 0 [Note] [MY-010120] [Server] Binlog end\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}

It goes through shutting down a load of plugins, then I get

{"log":"2023-11-20T17:22:02.795135Z 0 [Note] [MY-013072] [InnoDB] Starting shutdown...\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:02.803293Z 0 [Note] [MY-013084] [InnoDB] Log background threads are being closed...\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640393Z 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 459786086280\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640599Z 0 [Note] [MY-012255] [InnoDB] Removed temporary tablespace data file: \"ibtmp1\"\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640632Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'MEMORY'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640649Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'CSV'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640663Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'PERFORMANCE_SCHEMA'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640724Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'wsrep'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640735Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'daemon_keyring_proxy_plugin'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640755Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'sha2_cache_cleaner'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640767Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'caching_sha2_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640782Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'sha256_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640792Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'mysql_native_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.641581Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'binlog'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.641984Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.29-21.1)  Percona XtraDB Cluster (GPL), Release rel21, Revision 250bc93, WSREP version 26.4.3.\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:15.212552Z 0 [Warning] [MY-011068] [Server] The syntax 'wsrep_slave_threads' is deprecated and will be removed in a future release. Please use wsrep_applier_threads instead.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214125Z 0 [Warning] [MY-010097] [Server] Insecure configuration for --secure-log-path: Current value does not restrict location of generated files. Consider setting it to a valid, non-empty path.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214653Z 0 [Warning] [MY-010918] [Server] 'default_authentication_plugin' is deprecated and will be removed in a future release. Please use authentication_policy instead.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214667Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.29-21.1) starting as process 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.215692Z 0 [Warning] [MY-013242] [Server] --character-set-server: 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217817Z 0 [Warning] [MY-010068] [Server] CA certificate /etc/mysql/ssl-internal/ca.crt is self signed.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217854Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217866Z 0 [Note] [MY-000000] [WSREP] New joining cluster node configured to use specified SSL artifacts\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217899Z 0 [Note] [MY-000000] [Galera] Loading provider /usr/lib64/galera4/libgalera_smm.so initial position: 9d49aa78-10b4-11ee-accf-4b4f95de788c:50379049\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217917Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library '/usr/lib64/galera4/libgalera_smm.so'\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.218505Z 0 [Note] [MY-000000] [Galera] wsrep_load(): Galera 4.12(04bfb95) by Codership Oy <info@codership.com> (modified by Percona <https://percona.com/>) loaded successfully.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.218547Z 0 [Note] [MY-000000] [Galera] CRC-32C: using 64-bit x86 acceleration.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219254Z 0 [Note] [MY-000000] [Galera] Found saved state: 9d49aa78-10b4-11ee-accf-4b4f95de788c:-1, safe_to_bootstrap: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219766Z 0 [Note] [MY-000000] [Galera] GCache DEBUG: opened preamble:\nVersion: 2\nUUID: 9d49aa78-10b4-11ee-accf-4b4f95de788c\nSeqno: -1 - -1\nOffset: -1\nSynced: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219789Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: version: 2, UUID: 9d49aa78-10b4-11ee-accf-4b4f95de788c, offset: -1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219854Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer initial scan...  0.0% (        0/524288024 bytes) complete.\n","file":"/var/lib/mysql/mysqld-error.log"}

I’m guessing it’s trying to do a crash recovery but failing ?

pxc config is set as follows : configuration: "\n[mysqld]\nwsrep_provider_options=\"gcache.size=500M; gcache.recover=yes\"\ninnodb_buffer_pool_chunk_size=1G\ninnodb_buffer_pool_size=12G \ \nmax_connections=180\ninnodb_buffer_pool_instances=8\nmax-binlog-size=100M\nbinlog-expire-logs-seconds=259200\ncharacter-set-server=utf8\n"

My pods do have requests and limits

    resources:
      limits:
        cpu: 5
        memory: 25Gi
      requests:
        cpu: 5
        memory: 17Gi

I have tried setting the limit as high as 35Gi in order to allow it to start, but the OOM Killer still kicks in soon after the pod starts. I’m reluctant to remove the memory limit as I have previously seen a pxc pod runaway with all of the memory in the worker and cause an outage.

I’m not hugely experienced in galera/xtradb. Any help gratefully received.

If the pod isn’t recoverable, what’s the recommended way to recreate it ?

matthewb · November 21, 2023, 5:16pm

I don’t see any evidence of a crash in the logs you provided. Can you show more log details around the crashing?

mtye · November 21, 2023, 6:58pm

Apologies, incorrect use of terminology. I had assumed it was trying to recover and consuming memory.

The pod gets into a cycle of starting up, consuming all the memory allowed to it and then being killed by the OOM Killer in Openshift.

I have attached a file showing a few cycles of this happening
pxc-1.txt (112.7 KB)

mtye · November 24, 2023, 4:44pm

After some further investigation, it looks like I’m suffering from this bug here - GCache::RingBuffer initial scan dies at 0.0% · Issue #624 · codership/galera · GitHub

Is the correct procedure here to delete the pod and PVC which will force an SST ?

Thanks

mtye · November 28, 2023, 1:16pm

The elegant approach is to run up a debug pod against the failing pxc pod and delete the galera.cache file. When the container restarts, an SST will get initiated.

Topic		Replies	Views
Percona XtraDB Node not able to start up - k8s Percona XtraDB Cluster 5.x kubernetes	1	353	March 18, 2024
PXC cluster CrashLoopBackOff Percona XtraDB Cluster 5.x mysql , percona , kubernetes	6	1498	February 27, 2024
Cluster crash when pod Re-join after been killed Percona XtraDB Cluster 5.x	1	506	November 29, 2018
Operator fails to rejoin crashed nodes to cluster without deleting it manually Percona Operator for MySQL	3	162	December 19, 2024
Percona operator 0.8: Can't recover from OOM crash Percona Operator for MySQL	1	129	July 27, 2024

One of the pxc pods is getting OOM Killed on startup (and I assume recovery)

Related topics