Occasional DB crashes in PXC 8.0.32 around "remote_endpoint: Transport endpoint is not connected"

Hi Folks,

A client of ours has been experiencing periodic crashes of nodes in a 3 node cluster using PXC 8.0.32

Here’s a snippet of the sort of logging the precedes and encompasses the crash:

2023-09-11T19:04:44.589657Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T19:19:25.209610Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T19:49:14.431043Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T20:04:48.907048Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T20:19:25.345296Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T20:34:40.575594Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T21:04:40.747935Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-11T21:34:39.281373Z 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
terminate called after throwing an instance of 'std::system_error'
  what():  remote_endpoint: Transport endpoint is not connected
2023-09-11T22:04:55.185559Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2023-09-11T22:04:55Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=df9f6877fc91c9a71d439f27569eabdef408f622
Server Version: 8.0.32-24.2 Percona XtraDB Cluster (GPL), Release rel24, Revision 2119e75, WSREP version 26.1.4.3, wsrep_26.1.4.3

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2253a31]
/usr/sbin/mysqld(print_fatal_signal(int)+0x39f) [0x1262d0f]
/usr/sbin/mysqld(handle_fatal_signal+0xd8) [0x1262df8]
/lib64/libpthread.so.0(+0x12cf0) [0x7f6c0ee8bcf0]
/lib64/libc.so.6(gsignal+0x10f) [0x7f6c0d23aacf]
/lib64/libc.so.6(abort+0x127) [0x7f6c0d20dea5]
/lib64/libstdc++.so.6(+0x9009b) [0x7f6c0dbdb09b]
/lib64/libstdc++.so.6(+0x9653c) [0x7f6c0dbe153c]
/lib64/libstdc++.so.6(+0x96597) [0x7f6c0dbe1597]
/lib64/libstdc++.so.6(+0x967f8) [0x7f6c0dbe17f8]
/usr/lib64/galera4/libgalera_smm.so(+0x922cf) [0x7f6c005e62cf]
/usr/lib64/galera4/libgalera_smm.so(+0x92d7c) [0x7f6c005e6d7c]
/usr/lib64/galera4/libgalera_smm.so(+0xa6885) [0x7f6c005fa885]
/usr/lib64/galera4/libgalera_smm.so(+0xb3c98) [0x7f6c00607c98]
/usr/lib64/galera4/libgalera_smm.so(+0x8e400) [0x7f6c005e2400]
/usr/lib64/galera4/libgalera_smm.so(+0x8e6b3) [0x7f6c005e26b3]
/usr/lib64/galera4/libgalera_smm.so(+0x1c15ae) [0x7f6c007155ae]
/usr/lib64/galera4/libgalera_smm.so(+0x1c16d6) [0x7f6c007156d6]
/lib64/libpthread.so.0(+0x81ca) [0x7f6c0ee811ca]
/lib64/libc.so.6(clone+0x43) [0x7f6c0d225e73]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

Are the preceding warnings a clue to what’s going on and does anyone have any ideas what might be causing the crash? I believe the issue was also occurring in their previous PXC 8.0.29 environment.

many thanks,

Neil

I should add that I don’t believe the node was doing any SST activity at the time of the crash (either as joiner or donor) so I’m puzzled about the “Initiating SST cancellation” note.

Hey @NeilBillett,
Are you buy chance doing any port scanning, intrusion detection, or otherwise UDP-traffic blasting the node? There is a JIRA on this, [PXC-4167] Node crashes with Transport endpoint is not connected - Percona JIRA, but it says resolved on 8.0.29. There’s another recent forum post by another user, similar to yours. I’ve ping’d the lead PXC developer to see if there’s something more. Please follow that JIRA in case there are updates.

Thank you as always Matthew.

That may well be happening - we’ve asked our client to confirm.

Would you mind posting the link for the other forum post you mentioned?

thanks

Neil

Hi Neil,
The other post: Percona xtradb cluster 8.0.32 nodes are getting crashed

Our lead PXC dev pushed a small patch, PXC-4167: Node crashes with Transport endpoint is not connected by kamil-holubicki · Pull Request #270 · percona/galera · GitHub If you’re brave, feel free to apply the patch and compile PXC yourself to see if it fixes the issue. Our team does not have a reproducible test case, so this patch is a best-effort-on-where-the-problem-could-be.

Otherwise, you’ll have to wait for the next release of PXC which could be as far as 30 days away. If you need something sooner, a support contract can get you a quicker build.

Thank you Matthew - that’s brilliant - we’ll take a look.

I think just knowing there is a fix in a pipeline might be enough for our client but if we can get something built I’ll report back with any findings.

best wishes,

Neil

Hi @matthewb

If its helpful to know I’ve been able to replicate the crash with a two node cluster on PXC 8.0.32 using nmap’s tcp connect scan against port 4567 on both nodes.

I’ve got our application connected to <node1> and if I leave this running from a third host:

while true; do nmap -T2 -sT <node1> -p4567; nmap -T2 -sT <node2> -p4567; done

…I see a lot of these in both node logs as the commands loop:

[Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected

…and after some time (usually minutes) <node1> usually falls over e.g:

2023-09-21T16:40:43.106276+01:00 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-21T16:40:44.811107+01:00 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
2023-09-21T16:40:46.527115+01:00 0 [Warning] [MY-000000] [Galera] Failed to accept: remote_endpoint: Transport endpoint is not connected
terminate called after throwing an instance of 'std::system_error'
  what():  remote_endpoint: Transport endpoint is not connected
2023-09-21T16:40:48.235578+01:00 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2023-09-21T15:40:48Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=df9f6877fc91c9a71d439f27569eabdef408f622
Server Version: 8.0.32-24.2 Percona XtraDB Cluster (GPL), Release rel24, Revision 2119e75, WSREP version 26.1.4.3, wsrep_26.1.4.3

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2253a31]
/usr/sbin/mysqld(print_fatal_signal(int)+0x39f) [0x1262d0f]
/usr/sbin/mysqld(handle_fatal_signal+0xd8) [0x1262df8]
/lib64/libpthread.so.0(+0x12cf0) [0x7f7a1e6f9cf0]
/lib64/libc.so.6(gsignal+0x10f) [0x7f7a1caa7aff]
/lib64/libc.so.6(abort+0x127) [0x7f7a1ca7aea5]
/lib64/libstdc++.so.6(+0x9009b) [0x7f7a1d44909b]
/lib64/libstdc++.so.6(+0x9653c) [0x7f7a1d44f53c]
/lib64/libstdc++.so.6(+0x96597) [0x7f7a1d44f597]
/lib64/libstdc++.so.6(+0x967f8) [0x7f7a1d44f7f8]
/usr/lib64/galera4/libgalera_smm.so(+0x922cf) [0x7f7a0f5872cf]
/usr/lib64/galera4/libgalera_smm.so(+0x92d7c) [0x7f7a0f587d7c]
/usr/lib64/galera4/libgalera_smm.so(+0xa6885) [0x7f7a0f59b885]
/usr/lib64/galera4/libgalera_smm.so(+0xb3c98) [0x7f7a0f5a8c98]
/usr/lib64/galera4/libgalera_smm.so(+0x8e400) [0x7f7a0f583400]
/usr/lib64/galera4/libgalera_smm.so(+0x8e6b3) [0x7f7a0f5836b3]
/usr/lib64/galera4/libgalera_smm.so(+0x1c15ae) [0x7f7a0f6b65ae]
/usr/lib64/galera4/libgalera_smm.so(+0x1c16d6) [0x7f7a0f6b66d6]
/lib64/libpthread.so.0(+0x81ca) [0x7f7a1e6ef1ca]
/lib64/libc.so.6(clone+0x43) [0x7f7a1ca92e73]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

Hopefull its helpful for your development team.

thanks,

Neil

Hi Neil,
Please add all of that to the JIRA ticket mentioned above. Excellent job on a creating a working test case!