Hello Everyone,
This post is a little investigation of some weird things I’ve faced up with my proxysql installation. And I didn’t sure about conclusion. Is it feature or bug?
Anyway,
We are using pretty simple mysql+proxysql installation. We have 3 mysql server’s (master + 2 replicas) and couple proxysqls on application servers. Simple as it is.
On mysql we are using pt-heartbeat to measure a replication lag.
On proxysql’s side everthing is also simple:
mysql> select * from mysql_replication_hostgroups;
+------------------+------------------+------------+--------------------+
| writer_hostgroup | reader_hostgroup | check_type | comment |
+------------------+------------------+------------+--------------------+
| 10 | 11 | read_only | Common replication |
+------------------+------------------+------------+--------------------+
This setup works fine for years, until proxysql 2.3.2 was realeased. When I’ve installed recent 2.3.2 for tests all servers become SHUNNED for some reason.
mysql> select hostgroup_id,hostname,status from runtime_mysql_servers where hostgroup_id=11 or hostgroup_id=10;
+--------------+------------+---------+
| hostgroup_id | hostname | status |
+--------------+------------+---------+
| 10 | 10.10.0.55 | SHUNNED |
| 11 | 10.10.0.53 | SHUNNED |
| 11 | 10.10.0.54 | SHUNNED |
| 11 | 10.10.0.55 | SHUNNED |
+--------------+------------+---------+
I’ve tried a many ways to find a root of this issue. Pretty clear it is all about replication. But what is the main cause?
After hours I find some interesting things. It is a rare circumstances.
- Our project have setup a replication and heartbeat about 10 years ago. During that years many servers was replaced. There was a lot of records about old servers in test.heartbeat table(we are using test db for a heartbeat daemon)
+----------------------------+-----------+----------------+-----------+-----------------------+---------------------+
| ts | server_id | file | position | relay_master_log_file | exec_master_log_pos |
+----------------------------+-----------+----------------+-----------+-----------------------+---------------------+
| 2012-05-24T00:50:24.001090 | 3 | log_bin.000041 | 377519349 | NULL | NULL |
| 2017-07-06T09:59:45.000520 | 106 | log_bin.000223 | 432800495 | log_bin.000790 | 310894215 |
| 2017-08-29T03:17:29.000600 | 112 | log_bin.000418 | 53579985 | log_bin.000193 | 107686777 |
| 2018-07-29T10:51:46.000440 | 28 | log_bin.002854 | 31313746 | NULL | NULL |
| 2018-08-08T08:09:52.000600 | 27 | log_bin.001472 | 206351451 | | 0 |
| 2019-04-08T18:39:49.000800 | 36 | log_bin.001529 | 385530635 | log_bin.002187 | 74147128 |
| 2019-08-23T05:24:33.000590 | 37 | log_bin.005040 | 9683159 | | 0 |
| 2020-08-05T06:29:29.000890 | 44 | log_bin.013246 | 241877 | | 0 |
| 2021-08-18T06:19:41.005030 | 49 | log_bin.015948 | 209660214 | | 0 |
| 2021-12-18 11:54:43 | 53 | NULL | NULL | NULL | NULL |
| 2021-12-18 11:54:48 | 54 | NULL | NULL | NULL | NULL |
| 2022-01-22T16:03:36.003960 | 55 | log_bin.006671 | 81197987 | | 0 |
+----------------------------+-----------+----------------+-----------+-----------------------+---------------------+
12 rows in set (0.00 sec)
- Recently @renecannao made this commit, which broke up our proxysql setup:
https://github.com/sysown/proxysql/commit/2a5121e52f98cee7b61302d26f46aa0ef8e10809
On previuos versions proxysql asks for minimal time difference, which works great with our setup. Old records made no difference on that query’s result
Now it is selects for max time difference, and as result, replication lag is about 10 years.
This change is important for multimaster environment, as far as i understand. But what about single master? Now I have to cleanup records about all off old masters. May cleanup flag or something should be included into pt-heartbeat?
I’m afraid, someday our project will replace mysql servers once again. Or we will change a master due maintence or something. And proxysql will fail.
I’ve wrote a doc in our wiki about this issue. But…
How we can mitigate it wihtout docs? Could someone give me an advice?
Thank you in advance
BTW Is new query for mesuring replication lag a Bug or a Feature? Let’s discuss