Issue with /srv volume

Hi,

I’m trying to use PMM started from AMI, but keep encounter issues with that.

we’ve started with m4.2xlarge instance. Added 50 mysql instances ,linux:metrics mysql:metrics mysql:queries (via slow log ), and 4 ProxySQL instances for monitoring. note: added LimitsNOFILE=65536 to prometheus service to get rid of “Too many open files” error.

The issue is: on the second day I’ve noticed the PMM WEB UI became unresponsive. it apperas the LA is too high:

#uptime
14:01:33 up 1 day, 5:23, 1 user, load average: 24.00, 23.94, 23.39

however atop shows cpu’s are in idle, see screenshot attached,

dmesg shows me
[105465.363093] XFS (dm-4): metadata I/O error: block 0x23776f0 (“xfs_buf_iodone_callback_error”) error 5 numblks 8
[105466.633074] XFS: Failing async write: 2984 callbacks suppressed
[105466.635627] XFS (dm-4): Failing async write on buffer block 0x23776f0. Retrying async write.

I see the disks are not full:

df -hT

Filesystem Type Size Used Avail Use% Mounted on
/dev/xvda1 xfs 128G 2.7G 126G 3% /
devtmpfs devtmpfs 16G 0 16G 0% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
tmpfs tmpfs 16G 649M 15G 5% /run
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/DataVG-DataLV xfs 205G 28G 178G 14% /srv
tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/0
tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001

df -h -i

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda1 128M 55K 128M 1% /
devtmpfs 4.0M 338 4.0M 1% /dev
tmpfs 4.0M 1 4.0M 1% /dev/shm
tmpfs 4.0M 397 4.0M 1% /run
tmpfs 4.0M 16 4.0M 1% /sys/fs/cgroup
/dev/mapper/DataVG-DataLV 205M 525K 205M 1% /srv
tmpfs 4.0M 1 4.0M 1% /run/user/0

I’ve tried to reboot the server, but it got stuck. My admins rebooted it via AWS Console, but after reboot LVM volume with /srv had disappeared:

df -hT

Filesystem Type Size Used Avail Use% Mounted on
/dev/xvda1 xfs 128G 2.9G 126G 3% /
devtmpfs devtmpfs 16G 0 16G 0% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
tmpfs tmpfs 16G 17M 16G 1% /run
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001

I must say this is the second time we’ve encountered the issue with lost /srv volume disappeared after reboot on high la (I wasn’t on it on the first time, so didn’t collect any data for post) . Also we’ve lost /srv on moving from t2 to m4 instance , and to another vpc, via image clone feature.

My admins decided to run new instance with single volume:

df -hT

Filesystem Type Size Used Avail Use% Mounted on
/dev/xvda1 xfs 512G 25G 488G 5% /
devtmpfs devtmpfs 16G 0 16G 0% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
tmpfs tmpfs 16G 137M 16G 1% /run
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001

but smth went wrong, at least mysql setup incomplete: mysql root pass wasn’t set, so I’ve found a temporary one from /var/log/mysql.log, and set root password to the one from /root/.my.cnf. next, I’ve found there is no orchestrator db and user, no user ‘percona’@‘localhost’ and so on.

This has happened to my PMM instance on EC2 a few times now also. Something seems pretty wrong where the instance can only run for about a week or two before it gets into this broken state.

I had the same issue with two pmm instances using the market place AMI. After approx. a week to 10 days I found each pmm instance would become unresponsive. I found that the volume (/srv) has been thin provisioned in lvm and very quickly runs out of metadata space basically causing the /srv volume to become unwriteable. You can check the metadata space usage with lvs -a