One of the pxc pods is getting OOM Killed on startup (and I assume recovery)

Openshift 4.12 (kubernetes 1.25)
Operator version is 1.12 (we’re about to upgrade to 1.13)

{"log":"2023-11-20T17:22:01.850635Z 0 [Note] [MY-012487] [InnoDB] DDL log recovery : begin\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.850973Z 0 [Note] [MY-012488] [InnoDB] DDL log recovery : end\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.854652Z 0 [Note] [MY-012922] [InnoDB] Waiting for purge to start\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.905220Z 0 [Note] [MY-000000] [WSREP] Recovered position: 9d49aa78-10b4-11ee-accf-4b4f95de788c:50379049\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:01.906678Z 0 [Note] [MY-012330] [InnoDB] FTS optimize thread exiting.\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:02.793637Z 0 [Note] [MY-010120] [Server] Binlog end\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}

It goes through shutting down a load of plugins, then I get

{"log":"2023-11-20T17:22:02.795135Z 0 [Note] [MY-013072] [InnoDB] Starting shutdown...\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:02.803293Z 0 [Note] [MY-013084] [InnoDB] Log background threads are being closed...\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640393Z 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 459786086280\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640599Z 0 [Note] [MY-012255] [InnoDB] Removed temporary tablespace data file: \"ibtmp1\"\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640632Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'MEMORY'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640649Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'CSV'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640663Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'PERFORMANCE_SCHEMA'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640724Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'wsrep'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640735Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'daemon_keyring_proxy_plugin'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640755Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'sha2_cache_cleaner'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640767Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'caching_sha2_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640782Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'sha256_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.640792Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'mysql_native_password'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.641581Z 0 [Note] [MY-010733] [Server] Shutting down plugin 'binlog'\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:03.641984Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.29-21.1)  Percona XtraDB Cluster (GPL), Release rel21, Revision 250bc93, WSREP version 26.4.3.\n","file":"/var/lib/mysql/wsrep_recovery_verbose.log"}
{"log":"2023-11-20T17:22:15.212552Z 0 [Warning] [MY-011068] [Server] The syntax 'wsrep_slave_threads' is deprecated and will be removed in a future release. Please use wsrep_applier_threads instead.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214125Z 0 [Warning] [MY-010097] [Server] Insecure configuration for --secure-log-path: Current value does not restrict location of generated files. Consider setting it to a valid, non-empty path.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214653Z 0 [Warning] [MY-010918] [Server] 'default_authentication_plugin' is deprecated and will be removed in a future release. Please use authentication_policy instead.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.214667Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.29-21.1) starting as process 1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.215692Z 0 [Warning] [MY-013242] [Server] --character-set-server: 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217817Z 0 [Warning] [MY-010068] [Server] CA certificate /etc/mysql/ssl-internal/ca.crt is self signed.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217854Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217866Z 0 [Note] [MY-000000] [WSREP] New joining cluster node configured to use specified SSL artifacts\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217899Z 0 [Note] [MY-000000] [Galera] Loading provider /usr/lib64/galera4/libgalera_smm.so initial position: 9d49aa78-10b4-11ee-accf-4b4f95de788c:50379049\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.217917Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library '/usr/lib64/galera4/libgalera_smm.so'\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.218505Z 0 [Note] [MY-000000] [Galera] wsrep_load(): Galera 4.12(04bfb95) by Codership Oy <info@codership.com> (modified by Percona <https://percona.com/>) loaded successfully.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.218547Z 0 [Note] [MY-000000] [Galera] CRC-32C: using 64-bit x86 acceleration.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219254Z 0 [Note] [MY-000000] [Galera] Found saved state: 9d49aa78-10b4-11ee-accf-4b4f95de788c:-1, safe_to_bootstrap: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219766Z 0 [Note] [MY-000000] [Galera] GCache DEBUG: opened preamble:\nVersion: 2\nUUID: 9d49aa78-10b4-11ee-accf-4b4f95de788c\nSeqno: -1 - -1\nOffset: -1\nSynced: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219789Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: version: 2, UUID: 9d49aa78-10b4-11ee-accf-4b4f95de788c, offset: -1\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2023-11-20T17:22:15.219854Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer initial scan...  0.0% (        0/524288024 bytes) complete.\n","file":"/var/lib/mysql/mysqld-error.log"}

I’m guessing it’s trying to do a crash recovery but failing ?

pxc config is set as follows : configuration: "\n[mysqld]\nwsrep_provider_options=\"gcache.size=500M; gcache.recover=yes\"\ninnodb_buffer_pool_chunk_size=1G\ninnodb_buffer_pool_size=12G \ \nmax_connections=180\ninnodb_buffer_pool_instances=8\nmax-binlog-size=100M\nbinlog-expire-logs-seconds=259200\ncharacter-set-server=utf8\n"

My pods do have requests and limits

    resources:
      limits:
        cpu: 5
        memory: 25Gi
      requests:
        cpu: 5
        memory: 17Gi

I have tried setting the limit as high as 35Gi in order to allow it to start, but the OOM Killer still kicks in soon after the pod starts. I’m reluctant to remove the memory limit as I have previously seen a pxc pod runaway with all of the memory in the worker and cause an outage.

I’m not hugely experienced in galera/xtradb. Any help gratefully received.

If the pod isn’t recoverable, what’s the recommended way to recreate it ?

I don’t see any evidence of a crash in the logs you provided. Can you show more log details around the crashing?

Apologies, incorrect use of terminology. I had assumed it was trying to recover and consuming memory.

The pod gets into a cycle of starting up, consuming all the memory allowed to it and then being killed by the OOM Killer in Openshift.

I have attached a file showing a few cycles of this happening
pxc-1.txt (112.7 KB)

After some further investigation, it looks like I’m suffering from this bug here - GCache::RingBuffer initial scan dies at 0.0% · Issue #624 · codership/galera · GitHub

Is the correct procedure here to delete the pod and PVC which will force an SST ?

Thanks

The elegant approach is to run up a debug pod against the failing pxc pod and delete the galera.cache file. When the container restarts, an SST will get initiated.