Got an errore reading / writing communication packets

Hi,

i have a 5 node xtradb cluster (4 node + 1 arbitrator).
After upgrading to the lastest version i have a lot of error in the error log

Aborted connection 1952 to db: ‘unconnected’ user: ‘root’ host: ‘localhost’ (Got an error writing communication packets)
Aborted connection 1980 to db: ‘unconnected’ user: ‘root’ host: ‘localhost’ (Got an error reading communication packets)
wsrep: failed to report las committed -110 (timeout)

i cannot find any network related issue and all nodes are on the same subnet with a very low latency

any idea about something to check?

this is the pt-summary output

Percona Toolkit System Summary Report

Date | 2019-05-31 14:08:45 UTC (local TZ: CEST +0200)
Hostname | DBEasyUnix
Uptime | 20 days, 3:38, 1 user, load average: 2,46, 3,06, 3,04
System | Microsoft Corporation; Virtual Machine; v7.0 (Desktop)
Service Tag | 1595-4263-5740-0369-2475-8856-28
Platform | Linux
Release | Debian GNU/Linux 8.11 (jessie) (jessie)
Kernel | 3.16.0-8-amd64
Architecture | CPU = 64-bit, OS = 64-bit
Threading | NPTL 2.19
SELinux | No SELinux detected
Virtualized | No virtualization detected

Processor

Processors | physical = 1, cores = 8, virtual = 8, hyperthreading = no
Speeds | 8x2099.978
Models | 8xIntel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
Caches | 8x15360 KB

Memory

Total | 14.7G
Free | 441.3M
Used | physical = 14.3G, swap allocated = 14.0G, swap used = 49.5M, virtual = 14.3G
Shared | 19.8M
Buffers | 300.3M
Caches | 8.6G
Dirty | 9052 kB
UsedRSS | 5.8G
Swappiness | 60
DirtyPolicy | 20, 10
DirtyStatus | 0, 0
Locator Size Speed Form Factor Type Type Detail
========= ======== ================= ============= ============= ===========
M1 11392 MB Unknown Unknown Other Unknown
M0 3968 MB Unknown Unknown Other Unknown

Mounted Filesystems

Filesystem Size Used Type Opts Mountpoint
/dev/sda1 178G 41% ext4 rw,relatime,errors=remount-ro,data=ordered /
/dev/sdc1 13T 66% ext4 rw,noatime,nodiratime,errors=remount-ro,data=ordered /var/lib/mysql
tmpfs 1,5G 0% tmpfs rw,nosuid,nodev /run/user/0
tmpfs 1,5G 0% tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k /run/user/0
tmpfs 1,5G 0% tmpfs rw,nosuid,relatime,size=3085664k,mode=755 /run/user/0
tmpfs 1,5G 0% tmpfs rw,nosuid,nodev,relatime,size=1542832k,mode=700 /run/user/0
tmpfs 1,5G 0% tmpfs ro,nosuid,nodev,noexec,mode=755 /run/user/0
tmpfs 3,0G 1% tmpfs rw,nosuid,nodev /run
tmpfs 3,0G 1% tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k /run
tmpfs 3,0G 1% tmpfs rw,nosuid,relatime,size=3085664k,mode=755 /run
tmpfs 3,0G 1% tmpfs rw,nosuid,nodev,relatime,size=1542832k,mode=700 /run
tmpfs 3,0G 1% tmpfs ro,nosuid,nodev,noexec,mode=755 /run
tmpfs 5,0M 0% tmpfs rw,nosuid,nodev /run/lock
tmpfs 5,0M 0% tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k /run/lock
tmpfs 5,0M 0% tmpfs rw,nosuid,relatime,size=3085664k,mode=755 /run/lock
tmpfs 5,0M 0% tmpfs rw,nosuid,nodev,relatime,size=1542832k,mode=700 /run/lock
tmpfs 5,0M 0% tmpfs ro,nosuid,nodev,noexec,mode=755 /run/lock
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev /dev/shm
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k /dev/shm
tmpfs 7,4G 0% tmpfs rw,nosuid,relatime,size=3085664k,mode=755 /dev/shm
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev,relatime,size=1542832k,mode=700 /dev/shm
tmpfs 7,4G 0% tmpfs ro,nosuid,nodev,noexec,mode=755 /dev/shm
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev /sys/fs/cgroup
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k /sys/fs/cgroup
tmpfs 7,4G 0% tmpfs rw,nosuid,relatime,size=3085664k,mode=755 /sys/fs/cgroup
tmpfs 7,4G 0% tmpfs rw,nosuid,nodev,relatime,size=1542832k,mode=700 /sys/fs/cgroup
tmpfs 7,4G 0% tmpfs ro,nosuid,nodev,noexec,mode=755 /sys/fs/cgroup
udev 10M 0% devtmpfs rw,relatime,size=10240k,nr_inodes=1926397,mode=755 /dev

Disk Schedulers And Queue Size

sda | [cfq] 128
sdb | [cfq] 128
sdc | [cfq] 128

Disk Partioning

Device Type Start End Size
============ ==== ========== ========== ==================
/dev/sda Disk 193273528320
/dev/sda1 Part 2048 377485311 0
/dev/sdb Disk 15032385536
/dev/sdb1 Part 2048 29358079 0
/dev/sdc Disk 14293651161088
/dev/sdc1 Part 2048 27917285375 0

Kernel Inode State

dentry-state | 31528 19480 45 0 0 0
file-nr | 1984 0 1540893
inode-nr | 86673 56832

LVM Volumes

Unable to collect information

LVM Volume Groups

Unable to collect information

RAID Controller

Controller | No RAID controller detected

Network Config

FIN Timeout | 60
Port Range | 61000

Interface Statistics

interface rx_bytes rx_packets rx_errors tx_bytes tx_packets tx_errors
========= ========= ========== ========== ========== ========== ==========
lo 1250000 10000 0 1250000 10000 0
eth0 500000000000 500000000 0 60000000000 250000000 0
eth1 300000000000 900000000 0 9000000000000 400000000 0

Network Connections

Connections from remote IP addresses
10.0.0.38 2
10.0.0.80 3
10.0.0.81 1
10.0.0.82 2
192.168.3.39 10
192.168.3.222 5
Connections to local IP addresses
10.0.0.37 8
192.168.3.37 15
Connections to top 10 local ports
3306 10
42002 1
4567 3
50430 1
50506 1
51597 1
57345 1
59512 1
59514 1
60933 1
States of connections
ESTABLISHED 25
LISTEN 20
TIME_WAIT 1

Top Processes

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8637 root 20 0 41196 13312 3476 S 37,2 0,1 567:50.08 node_expor+
8680 root 20 0 53968 29504 4024 S 37,2 0,2 947:07.50 mysqld_exp+
33 root 20 0 0 0 0 S 18,6 0,0 274:13.03 ksoftirqd/5
25143 mysql 20 0 66,682g 5,536g 832760 S 18,6 37,6 32:53.22 mysqld
3 root 20 0 0 0 0 S 12,4 0,0 339:51.83 ksoftirqd/0
43 root 20 0 0 0 0 S 12,4 0,0 326:40.74 ksoftirqd/7
41817 root 20 0 25616 2900 2468 R 6,2 0,0 0:00.01 top
1 root 20 0 28976 4968 2908 S 0,0 0,0 0:30.09 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:00.22 kthreadd

Notable Processes

PID OOM COMMAND
691 -17 sshd

Simplified and fuzzy rounded vmstat (wait please)

procs —swap-- -----io---- —system---- --------cpu--------
r b si so bi bo ir cs us sy il wa st
1 0 0 0 700 225 2 2 5 2 89 4 0
0 0 0 0 1500 2500 3000 7000 11 5 75 9 0
1 1 150 0 1000 3500 2000 4000 8 2 85 5 0
0 0 0 0 700 1500 2250 4500 10 2 87 1 0
3 0 0 0 0 2500 1500 3500 6 1 89 4 0

Memory mamagement

Transparent huge pages are enabled.

The End

i have the same problem. happened after the update to 5.7.25. latest version still has the issue.

did you ever find a fix to this inoma ?

i have no idea what fixed this, i haven’t made any config changes. what happened was that 1 node failed, so i made an image, then terminated the ec2 instance and restored the ami to a new ec2 machine. then i saw the errors gone on that machine. then i just went ahead and killed another node to see if it has the same behavior and it did.

long story short - i terminated and rebuilt all my nodes without making any percona changes and the error magically went away.

no idea what caused this and no idea what fixed it. i feel left in the dark