Getting error - accept4: too many open files; retrying in x ms

I’m going to try my best to explain this situation:

I have 6 physical servers, and 3 of them were recently upgraded. They now have Ubuntu 18.04 (formerly 14.04). These servers (vhost-01, vhost-02, vhost-03, vhost-04, vhost-05, vhost-06) have VirtualBox installed. The VMs do minor things (internal mail servers, NTP servers, nameservers, etc). We recently had a network outage, so I came up with the idea to add them to our Grafana dashboards so we would be notified if something was wrong.
All was going fine, until a couple of weeks after installing pmm-admin. Every VM on a single vhost started alerting for CPU usage. It was odd for these servers to have high CPU usage, but it was even weirder for all of the VMs that were alerting to belong to a single physical host. This is only happening on vhost-01, vhost-02, and vhost-03, which were the servers that received the OS upgrade, however, 2 more of them already had Ubuntu 18.04, but they are not involved with this problem.

After troubleshooting for a while, I noticed that on each individual VM on these particular physical hosts (vhost-01, vhost-02, vhost-03), the node_exporter process was maxing out the CPU usage. I checked the pmm metrics log and was presented with this, over and over and over:

time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: filefd collector failed after 0.000559s: couldn’t get file-nr: open /proc/sys/fs/file-nr: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: filesystem collector failed after 0.000357s: open /proc/mounts: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: loadavg collector failed after 0.000683s: couldn’t get load: open /proc/loadavg: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: meminfo collector failed after 0.000398s: couldn’t get meminfo: open /proc/meminfo: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: vmstat collector failed after 0.000319s: open /proc/vmstat: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: diskstats collector failed after 0.000597s: couldn’t get diskstats: open /proc/diskstats: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: netdev collector failed after 0.000360s: couldn’t get netstats: open /proc/net/dev: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: netstat collector failed after 0.000282s: couldn’t get netstats: open /proc/net/netstat: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“ERROR: stat collector failed after 0.000358s: open /proc/stat: too many open files” source=“node_exporter.go:97”
time=“2020-01-17T07:49:19-06:00” level=error msg=“Error reading textfile collector directory /usr/local/percona/pmm-client/textfile_collector: open /usr/local/percona/pmm-client/textfile_collector: too many open files” source=“textfile.go:81”
2020/01/17 07:49:20 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 5ms
2020/01/17 07:49:21 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 10ms
2020/01/17 07:49:21 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 20ms
2020/01/17 07:49:22 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 40ms
2020/01/17 07:49:22 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 80ms
2020/01/17 07:49:22 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 160ms
2020/01/17 07:49:23 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 320ms
2020/01/17 07:49:23 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 640ms
2020/01/17 07:49:24 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 1s
2020/01/17 07:49:26 http: Accept error: accept tcp 192.168.78.78:42000: accept4: too many open files; retrying in 1s

This has been happening off and on for weeks now, and I cannot seem to pinpoint what is causing it. It is always isolated to a single physical vhost at a time, but it happens at the same time on every virtual machine that is present on that vhost.

For example: let’s say I have mail_server_01, nameserver_01, and NTP_server on the physical server vhost-01, then those 3 virtual machines will all show this error within minutes of each other. I then have to shut off linux:metrics and pause the alert because it doesn’t stop.

I thought maybe changing the file limit for the node_exporter process would help, so I did: prlimit --pid $(pidof node_exporter) --nofile=4096, but that did not work. I’ve tried restarting pmm-admin (pmm-admin restart linux:metrics) to no avail. Rebooting the physical server only solves it for a few days to a week, then happens again. I don’t understand why this is only happening on vhost-01, vhost-02, and vhost-03. The other three vhost servers are unaffected and have never received this error.

Any help to troubleshoot would be greatly appreciated.

I’ve attached a screenshot of the spike that always occurs to get a visual understanding of what I’m dealing with. Stopping pmm-admin on the virtual machine was the only thing that stops this:

Hi,

You did not specify PMM Version here. What is the client and server versions ?

If you look at Prometheus Exporter Status you should be able to see how many files exporter is using. Typically it should not be more than 20 file descriptors

I wonder if in your case value is too high or the limit of allowed files for the process is extremely low.

PMM Server version: 5.1.3
PMM Client version: 1.17.2

I restarted pmm-admin on the problem virtual machines. The file descriptors are currently low, so I will need to wait and see what it shows once it happens again.

The error is back. I checked how many files were open by that process with: `lsof -p $(pidof node_exporter) | wc | awk ‘{print $1}’, which gives me 1028.

The ‘File Descriptors Used’ portion of the Prometheus Exporter Status dashboard says 7, then flashes 1k for a second, then goes back to 7.

What I don’t understand is that these limits are the same for every server. It’s only the virtual machines on these 3 servers (vhost-01, vhost-02, vhost-03) that are giving the “too many open files error”. On the other vhosts (vhost-04, vhost-05, vhost-06), this does not occur.