Galera Arbitrator (garbd) uses 100% CPU

I’ve just upgraded PXC from 5.7 to 8.0. Galera Arbitrator version is

percona-xtradb-cluster-garbd/unknown,now 1:8.0.33-25-1.focal amd64

and it’s using 100% CPU.

There’s nothing suspicious in the log – apart from 0-nan% (0/0 events):

Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.171  INFO: Flow-control interval: [1048575, 1048575]
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.171  INFO: Shifting OPEN -> PRIMARY (TO: 257099)
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.171  INFO: Sending state transfer request: 'trivial', size: 7
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.203  INFO: Member 0.0 (garb) requested state transfer from '*any*'. Selected 1.0 (nyc1)(SYNCED) as donor.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.203  INFO: Shifting PRIMARY -> JOINER (TO: 257099)
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.235  INFO: 0.0 (garb): State transfer from 1.0 (nyc1) complete.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.236  INFO: SST leaving flow control
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.236  INFO: Shifting JOINER -> JOINED (TO: 257099)
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.237  INFO: Processing event queue:...0-nan% (0/0 events) complete.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.237  INFO: 1.0 (nyc1): State transfer to 0.0 (garb) complete.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.264  INFO: Member 0.0 (garb) synced with group.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.265  INFO: Processing event queue:...100.0% (1/1 events) complete.
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.265  INFO: Shifting JOINED -> SYNCED (TO: 257099)
Aug 13 12:38:52 aux garb-systemd[2284]: 2023-08-13 12:38:52.265  INFO: Member 1.0 (nyc1) synced with group.
Aug 13 12:38:54 aux garb-systemd[2284]: 2023-08-13 12:38:54.612  INFO: (5b34fe14-b138, 'ssl://0.0.0.0:4567') turning message relay requesting off

Here goes the head of the perf report:

Samples: 1K of event 'cpu-clock:pppH', Event count (approx.): 9770000000
  Children      Self  Command  Shared Object        Symbol
+  100.00%     0.00%  garbd    libc-2.31.so         [.] __clone
+  100.00%     0.00%  garbd    libpthread-2.31.so   [.] start_thread
+   99.69%     0.00%  garbd    garbd                [.] gcs_recv_thread
+   99.69%     0.00%  garbd    garbd                [.] gcs_core_recv
+   99.69%    78.76%  garbd    garbd                [.] gcomm_recv
+   20.01%    19.96%  garbd    garbd                [.] pfs_noop
+    0.92%     0.00%  garbd    [kernel.kallsyms]    [k] irq_exit_rcu
+    0.92%     0.20%  garbd    [kernel.kallsyms]    [k] __softirqentry_text_start
+    0.51%     0.00%  garbd    [kernel.kallsyms]    [k] asm_common_interrupt
+    0.51%     0.00%  garbd    [kernel.kallsyms]    [k] common_interrupt
     0.46%     0.46%  garbd    [kernel.kallsyms]    [k] __lock_text_start
     0.46%     0.00%  garbd    [kernel.kallsyms]    [k] asm_sysvec_apic_timer_interrupt
     0.46%     0.00%  garbd    [kernel.kallsyms]    [k] sysvec_apic_timer_interrupt
     0.46%     0.00%  garbd    [kernel.kallsyms]    [k] run_timer_softirq
     0.41%     0.00%  garbd    [kernel.kallsyms]    [k] call_timer_fn
     0.41%     0.00%  garbd    [kernel.kallsyms]    [k] rh_timer_func
     0.41%     0.00%  garbd    [kernel.kallsyms]    [k] usb_hcd_poll_rh_status
     0.41%     0.00%  garbd    [kernel.kallsyms]    [k] uhci_hub_status_data
     0.31%     0.00%  garbd    garbd                [.] run_fn
     0.31%     0.00%  garbd    garbd                [.] GCommConn::run
     0.31%     0.00%  garbd    garbd                [.] gcomm::AsioProtonet::event_loop
     0.31%     0.00%  garbd    garbd                [.] gu::AsioIoService::run
     0.31%     0.00%  garbd    garbd                [.] asio::detail::scheduler::run
     0.26%     0.00%  garbd    [kernel.kallsyms]    [k] net_rx_action
     0.26%     0.00%  garbd    [kernel.kallsyms]    [k] __napi_poll
     0.26%     0.00%  garbd    [kernel.kallsyms]    [k] virtnet_poll
     0.15%     0.05%  garbd    garbd                [.] asio::detail::epoll_reactor::descriptor_state::do_complete

This seems like a bug in garbd.
Where should I report it?

Hello @Big_Boss,
If you have a repeatable test case, and/or some more specific steps including all commands executed, you can file a bug report at https://jira.percona.com/ but you will need to have all that information in order for our engineers to verify/repeat the bug.

1 Like

I just performed a fresh install, converting a 3-node XtraDB 8.0 cluster into a 2-node XtraDB 8.0 cluster plus garbd.

I’m also experiencing CPU saturation running the latest version percona-xtradb-cluster-garbd.x86_64 8.0.33-25.1 under Rocky Linux 8. There is nothing remarkable about my configuration apart from having SSL enabled.

That is very odd. Can you clarify your reasoning for this? It is always recommended, and is the best practice to use actual nodes instead of the garbd. I’m curious for your reasons here.

I’ve just filled the bug report – [PXC-4288] Galera Arbitrator (garbd) uses 100% CPU - Percona JIRA.

1 Like

Hi Big_boss,

If you use Galera Arbitrator (garbd), we recommend that you do not upgrade to 8.0.33 because garbd-8.0.33 may cause synchronization issues and extensive usage of CPU resources.

Source: Percona XtraDB Cluster 8.0.33-25 Update (2023-08-25) - Percona XtraDB Cluster

The workaround would be to downgrade garbd to 8.0.32 until 8.0.34 is released which will fix the issue.

Regards

1 Like