In our 3-node (whereof one virtual node) Percona XtraDB cluster our master node crashed on Tuesday 2017-12-12 at 07:08:57. This is well logged in /var/log/mysqld.log. Luckily the fail-over worked. The other node took over the responsibility and the crashed node could get back after SST and resume it’s work as master node. We’re happy for that.
There was no maintenance going on at the time of the crash, no application updates modifying the traffic to the database cluster or anything else.
After the crash the rate of writes to Galera cache pages (/var/lib/mysql/gcache.page.xxxxxx) has increased, though. Before the crash there was one Galera cache page written per week (!) and now it’s at a steady level of approximately 20 per day.
When I google the issue I realize that Galera cache pages are written when there are big writesets to be written, i.e. too big to fit in the regular circular cache. This seems to be related to the case, according to our monitoring tools that report spikes in writeset traffic at (almost all of) the times for Galera cache page creation.
We have a specific part of the system generating large transactions and it’s often when these are committed that a Galera cache page is written. But, there’s not a perfect match. These transactions are generated much more often than we see the cache page creation. They also generate approximately 15 - 40 MB writeset traffic per transaction, but a cache page can hold 128 MB.
Question 1: Do you think that these cache page writes are a problem?
Question 2: How do we get back to the low level of a single cache page written per week?
As I’ve mentioned, the traffic is just the same now as before the crash/downtime, so I guess it has to do with some configuration variables.
Any help appreciated. If logs or anything is needed I will supply whatever is possible.