proxysql looses server after restart of mysqld (percona xtradb cluster)

Hello
I have a testsetup, 4 vm, 3 node percona xtradb cluster + proxysql running on each, 1 vm as client.
RHEL 6.9
Percona-XtraDB-Cluster-57-5.7.19-29.22.3.el6.x86_64
proxysql-1.4.3-1.1.el6.x86_64
all from percona repository
proxysql is setup with 2 hostgroups, one for write and one for read traffic, but the following problem is before any client connect:


(admin@localhost) [(none)]> select * from mysql_servers order by hostgroup_id,hostname;
+--------------+--------------+------+--------------+------------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+--------------+------------+-------------+-----------------+---------------------+---------+----------------+---------+
| 500 | 192.168.0.51 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.51 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.52 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
+--------------+--------------+------+--------------+------------+-------------+-----------------+---------------------+---------+----------------+---------+ 

I restart the mysqld on one node (/etc/init.d/myqsl restart) as we would do after an yum update run e.g.
After mysql is back up again:


(admin@localhost) [(none)]> select * from mysql_servers order by hostgroup_id,hostname;
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
| 500 | 192.168.0.51 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | WRITE |
| 501 | 192.168.0.52 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec) 

Note that this is the output of “select * from mysql_servers” but “select * from runtime_mysql_servers” shows the same. This happens to proxysql on the 1. node also if I restart on the 2. node. It does not always get corrupt, but often enough. We have a customer using proxysql on his appserver with the same effekt (1.3.9 installed there). It does not loose all servers, sometime it looks like this:


| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+--------------+------------+-------------+-----------------+---------------------+---------+----------------+---------+
| 500 | 192.168.0.51 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.51 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.52 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
+--------------+--------------+------+--------------+------------+-------------+-----------------+---------------------+---------+----------------+---------+ 

(1. node proxysql, restart of mysql on second node) Until we do “load mysql servers to memory;” proxysql keeps this broken state. Depending on the state the next shutdown of a node could lead to a downtime of the application, as no failover is possible. Anything more I could provide to get this fixed?
Hubertus

/var/lib/proxysql/proxysql_galera_checker.log entries around a /etc/init.d/mysql stop && sleep 10 && /etc/init.d/mysql start
(restart node 1 proxysql on node 3)

see attached prox.galera.log.txt (to much text)

After restart
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.52 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.51 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 501 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+

Why is the 1. node not coming back to the host group, in the time the 1 node was down it look like this:

-------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.52 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+

I would expect “OFFLINE” / “OFFLINE_HARD” entries all the time and not entries disappearing.

show global variables; attached prox.global.txt

prox.global.txt (10.8 KB)

prox.galera.log.txt (12.1 KB)

.
/var/lib/proxysql/proxysql_galera_checker.log entries around a /etc/init.d/mysql stop && sleep 10 && /etc/init.d/mysql start
(restart node 1 proxysql on node 3)

see attached prox.galera.log.txt (to much text)

After restart
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.52 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.51 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 501 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+

Why is the 1. node not coming back to the host group, in the time the 1 node was down it look like this:

-------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.52 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+

I would expect “OFFLINE” / “OFFLINE_HARD” entries all the time and not entries disappearing.

show global variables; attached prox.global.txt

prox.global.txt (10.8 KB)

prox.galera.log.txt (12.1 KB)

Hello
sad to see no one has the same issue (or does no one use proxysql with pxc?)
I upgraded to
proxysql-1.4.4-1.1.el6.x86_64
problem persists.
(admin@localhost) [(none)]> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id,hostname;

+--------------+--------------+--------------+
| hostgroup_id | hostname | status |
+--------------+--------------+--------------+
| 500 | 192.168.0.51 | ONLINE |
| 500 | 192.168.0.53 | OFFLINE_SOFT |
| 501 | 192.168.0.51 | OFFLINE_SOFT |
| 501 | 192.168.0.52 | ONLINE | # was writer before restart, does not get back into writer hostgroup
| 501 | 192.168.0.53 | ONLINE |

then restart mysql on 1. node


+--------------+--------------+--------------+
| hostgroup_id | hostname | status |
+--------------+--------------+--------------+
| 500 | 192.168.0.52 | ONLINE | # now 2. node ist back again as writer but not as possible reader
| 500 | 192.168.0.53 | OFFLINE_SOFT |
| 501 | 192.168.0.51 | ONLINE |
| 501 | 192.168.0.53 | ONLINE |
+--------------+--------------+--------------+

mysql stop on node 3

+--------------+--------------+--------+
| hostgroup_id | hostname | status |
+--------------+--------------+--------+
| 500 | 192.168.0.52 | ONLINE |
| 501 | 192.168.0.51 | ONLINE |
+--------------+--------------+--------+

mysql start on node 3

+--------------+--------------+--------+
| hostgroup_id | hostname | status |
+--------------+--------------+--------+
| 500 | 192.168.0.52 | ONLINE |
| 501 | 192.168.0.51 | ONLINE |
| 501 | 192.168.0.53 | ONLINE |
+--------------+--------------+--------+

node 2 stop

+--------------+--------------+--------+
| hostgroup_id | hostname | status |
+--------------+--------------+--------+
| 500 | 192.168.0.51 | ONLINE |
| 501 | 192.168.0.53 | ONLINE |
+--------------+--------------+--------+

It seems at least the service does not go down but no loadbalancing on reads is done anymore and not the expected behaviour.

Hi HubertusKrogmann,

Sorry for the delayed response.

I think you have configured [B] will checks cluster node membership and re-configures ProxySQL if the membership change.

If you configure proxysql with tool.

https:/www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/proxysql.html#automatic-configuration.

Please let me know if you have any questions or concerns.

Hello
I configured proxysql manually, as I had a few users more to do and wanted read/write splitting
I configured /etc/proxysql-admin.cnf and created the users mentioned there
Scheduler was set (line with id=1)
±—±-------±------------±---------------------------------±-----±-----±-----±-----±----------------------------------------------±--------+
| id | active | interval_ms | filename | arg1 | arg2 | arg3 | arg4 | arg5 | comment |
±—±-------±------------±---------------------------------±-----±-----±-----±-----±----------------------------------------------±--------+
| 1 | 1 | 10000 | /usr/bin/proxysql_galera_checker | 500 | 501 | 1 | 0 | /var/lib/proxysql/proxysql_galera_checker.log | |
| 10 | 1 | 3000 | /usr/bin/proxysql_galera_checker | 500 | 501 | 1 | 1 | /var/lib/proxysql/proxysql_galera_check.log | |
±—±-------±------------±---------------------------------±-----±-----±-----±-----±----------------------------------------------±--------+
proxysql-admin added the line (id=10)
/var/lib/proxysql/proxysql_node_monitor.log shows
Thu Feb 8 14:46:34 CET 2018 Cluster node (501:192.168.0.52:3306) current status ‘ONLINE’ in ProxySQL database!
Thu Feb 8 14:46:34 CET 2018 Cluster node (501:192.168.0.53:3306) current status ‘ONLINE’ in ProxySQL database!
Thu Feb 8 14:46:34 CET 2018 Cluster node (500:192.168.0.51:3306) current status ‘ONLINE’ in ProxySQL database!
Thu Feb 8 14:46:34 CET 2018 ###### Loading mysql_servers config into runtime ######
every interval, wether configured manually or through proxysql-admin.

I deleted the row id=1.

Now, after calling “proxysql-admin --config-file=/etc/proxysql-admin.cnf --enable”
(admin@localhost) [(none)]> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-------+
| hostgroup_id | hostname | status |
±-------------±-------------±-------+
| 500 | 192.168.0.51 | ONLINE |
| 501 | 192.168.0.52 | ONLINE |
| 501 | 192.168.0.53 | ONLINE |
±-------------±-------------±-------+

the other members in the groups are now shown at all. (wanted behaviour?)

shutting down node .51:

(admin@localhost) [(none)]> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-------------+
| hostgroup_id | hostname | status |
±-------------±-------------±-------------+
| 500 | 192.168.0.53 | ONLINE |
| 501 | 192.168.0.51 | OFFLINE_SOFT |
| 501 | 192.168.0.52 | ONLINE |
±-------------±-------------±-------------+
3 rows in set (0.00 sec)

(admin@localhost) [(none)]> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-------+
| hostgroup_id | hostname | status |
±-------------±-------------±-------+
| 500 | 192.168.0.53 | ONLINE |
| 501 | 192.168.0.52 | ONLINE |
±-------------±-------------±-------+
2 rows in set (0.00 sec)

starting it again:

(admin@localhost) [(none)]> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-------+
| hostgroup_id | hostname | status |
±-------------±-------------±-------+
| 500 | 192.168.0.53 | ONLINE |
| 501 | 192.168.0.51 | ONLINE |
| 501 | 192.168.0.52 | ONLINE |
±-------------±-------------±-------+

so it does come back now (at least in this one test) but it does not show
“unused” nodes as OFFLINE_SOFT anymore. That may be the right behaviour
when configured with proxysql-admin.

what does proxysql-admin do exactly?
Sometimes we configure PXC with an extra interface for galera traffic, so gcomm:// IPs are
different to the ones thar proxysql can reach (if not running on a pxc node) so
export PROXYSQL_HOSTNAME=“host-ip” does not match gcomm://IPs and proxysql-admin
will not be able to configure the servers correctly.
Because of that I would like to know what proxysql-admin does to get a correct
working proxysql_galera_checker and proxysql_node_monitor.
Also for configuring proxysql with ansible and alike the exact settings are important.

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/proxysql.html
has a section “Manual Configuration” so “you did not use proxysql-admin” can’t be the
answer to that behaviour, as proxysql-admin only does some configuration into proxysql.
I would like to see what is not correct, and where proxysql_galera_checker and proxysql_node_monitor
do not work correctly. I would like to see errors or timeouts because of the manual
configuration.

Every additional setting (query rules, more users, …) will alter what proxysql-admin
has done, so we will be back to a manual configuration.

Hmm, proxysql_admin configured:

(admin@localhost) [(none)]> select * from mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.53 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | WRITE |
| 501 | 192.168.0.51 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 501 | 192.168.0.52 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+

Manually configured:

(admin@localhost) [(none)]> select * from mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.51 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.52 | 3306 | OFFLINE_SOFT | 1000000 | 0 | 1000 | 0 | 0 | 0 | |
| 500 | 192.168.0.53 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.51 | 3306 | OFFLINE_SOFT | 100 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.52 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
| 501 | 192.168.0.53 | 3306 | ONLINE | 1000000000 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-------------±-----±-------------±-----------±------------±----------------±--------------------±--------±---------------±--------+

Of course I wanted all servers in both groups for availability to fail over.
Am I right that proxysql_node_monitor / proxysql_galera_checker using the “comment” column to determine which state the nodes have and
that those script do not function properly because thes comments are missing?
In this case an exact documentation of what proxysq-admin does and how to build a configuration manually would be even more needed.
Regards

Reading those scripts, proxysql-admin configures:
/etc/proxysql-admin.cnf

export WRITE_HOSTGROUP_ID=“500”
export READ_HOSTGROUP_ID=“501”
export MODE=“singlewrite|loadbal”

singlewrite -> 1 Write 2 Read and write != read
loadbal -> 3 Write 3 Read

while proxysql_galera_checker has an option “writers are readers”
Usage: $0 <hostgroup_id write> [hostgroup_id read] [number writers] [writers are readers 0|1] [log_file]

which proxysql_node_monitor is not able to use?

That would make a good feature request for proxysql on PXC to have
singlewriteallred : 1 Write 3 Read (aka writers are readers)

because

  • more readers are available in the typical 3 node setup
  • in situations where you take 2 nodes down (maintenance …) the one who gets writes also can serve reads
    otherwise it looks like this and any query rules on “SELECT.*” to use a read hostgroup will get problems.

(admin@localhost) [(none)]> select * from runtime_mysql_servers order by hostgroup_id,hostname;
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+
| 500 | 192.168.0.51 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | WRITE |
±-------------±-------------±-----±-------±--------±------------±----------------±--------------------±--------±---------------±--------+

Hi HubertusKrogmann,

Replies inline

so it does come back now (at least in this one test) but it does not show “unused” nodes as OFFLINE_SOFT anymore. That may be the right behaviour when configured with proxysql-admin.

Yes, it is expected behavior. OFFLINE_SOFT node will be removed from proxysql db and add it back when it is ONLINE. proxysql-admin 1.4.5 onwards OFFLINE node will be there in proxysql db with status OFFLINE_HARD

what does proxysql-admin do exactly?
Sometimes we configure PXC with an extra interface for galera traffic, so gcomm:// IPs are different to the ones thar proxysql can reach (if not running on a pxc node) so export PROXYSQL_HOSTNAME=“host-ip” does not match gcomm://IPs and proxysql-admin will not be able to configure the servers correctly. Because of that I would like to know what proxysql-admin does to get a correct working proxysql_galera_checker and proxysql_node_monitor. Also for configuring proxysql with ansible and alike the exact settings are important.

proxysql-admin tool will help to configure your cluster nodes based on loadbal/singlewrite mode. IPs are taking based on wsrep_incoming_address status variable. PROXYSQL_HOSTNAME is only used for login purpose. If PROXYSQL_HOSTNAME does not match with wsrep_incoming_address status variable proxysql-admin will terminate with the appropriate error message. proxysql_galera_checker and proxysql_node_monitor script will work properly if we configure PXC with proxysql-admin tool

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/proxysql.html
has a section “Manual Configuration” so “you did not use proxysql-admin” can’t be the
answer to that behaviour, as proxysql-admin only does some configuration into proxysql.
I would like to see what is not correct, and where proxysql_galera_checker and proxysql_node_monitor
do not work correctly. I would like to see errors or timeouts because of the manual
configuration.

Even if you configure proxysql manually proxysql_galera_checker and proxysql_node_monitor will monitor PXC status properly. Make sure to mention login credentials and custom variables properly in /etc/proxysql-admin.cnf

Am I right that proxysql_node_monitor / proxysql_galera_checker using the “comment” column to determine which state the nodes have and
that those script do not function properly because thes comments are missing?
In this case an exact documentation of what proxysq-admin does and how to build a configuration manually would be even more needed.
Regards

“comment” column will help the user to identify the node state. The script does not use this “comment” to determine the node state.

while proxysql_galera_checker has an option “writers are readers”
Usage: $0 <hostgroup_id write> [hostgroup_id read] [number writers] [writers are readers 0|1] [log_file]

which proxysql_node_monitor is not able to use?

That would make a good feature request for proxysql on PXC to have
singlewriteallred : 1 Write 3 Read (aka writers are readers)

because

  • more readers are available in the typical 3 node setup
  • in situations where you take 2 nodes down (maintenance …) the one who gets writes also can serve reads
    otherwise it looks like this and any query rules on “SELECT.*” to use a read hostgroup will get problems.

Yes, you are right. We already have one ticket to change the current behavior https://jira.percona.com/browse/PSQLADM-12