proxysql dropping client connection in case of node failure

Hi there!I’m trying to make a POC that in case of node failure during the query, proxysql will redirect the connection to alive backend node. I’m running sysbench against proxysql with a basic 3 node PXC configured with just one active writer node (max_writers=1).
problem: when I kill mysqld on writer node the proxysql drops client connection and sysbench fails with: ‘HY000’: Lost connection to MySQL server during query’.
expected behavior:  killing the mysqld doesn’t affect the client connection and it’ll be redirected to alive backend node from backup_writer_hostgroup. If the sql query isn’t a transaction I’m expecting the proxysql to rerun the query on the alive node.
proxysql 2.0.12-38-g58a909a, pxc 8.0.19-10, sysbench 1.0.20, centos 7, kernel 3.10.0-1127.13.1.el7.x86_64
on proxysql node I run:

sysbench /usr/share/sysbench/oltp_write_only.lua --db-driver=mysql --mysql-host=127.0.0.1 --mysql-user='sbuser' --mysql-password='sbpass' --mysql-port=6033 --mysql-db=sbtest --tables=1 --table_size=1000000 --db-ps-mode=disable --db-debug=on --threads=16 --report-interval=5 --time=300 --skip-trx=on --mysql-ignore-errors=1062 run

my proxysql tables:



before killing the mysqld on the writer node I have:
after killing the mysqld:
So the backup writer has been promoted to the writer HG, but the connection to sysbench was droppped and failed with ‘HY000’: Lost connection to MySQL server during query’.What can I do to preserve the client connection and reroute it to backup writer node?my proxysql.log from the time of killing the mysqld is attached.





proxysql.log (33.7 KB)

Hi,
before going ahead and start to debug the issue, would be possible for you to provide the following information:
mysql_user
global variables
Runtime version of the above images.
And finally if you can align your version with the latest tag v2.0.13
commit 91737e08d6df6f6346e383eec939266b064a95bc (HEAD -> v2.0.13, tag: v2.0.13, origin/v2.0.13)
thanks

Hi Marco,now I have ProxySQL version 2.0.13-107-g91737e0.The requested data is provided in attached text file.

proxysql_config.txt (21.9 KB)

I’m now running sysbench with ‘–mysql-ignore-errors=all’ parameter and observe its report. From the time of killing the writer node up until electing a new writer, there is a window couple of seconds long, where transactions are not processing. My intention is to close this window and make proxysql reroute the query during node failure in real time.
Can it be done this way? 386s ] thds: 16 tps: 422.02 qps: 8438.42 (r/w/o: 5896.30/1659.08/883.04) lat (ms,95%): 59.99 err/s: 0.00 reconn/s: 0.00
[ 387s ] thds: 16 tps: 486.78 qps: 9653.68 (r/w/o: 6766.97/1843.17/1043.53) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 388s ] thds: 16 tps: 471.07 qps: 9510.36 (r/w/o: 6659.95/1829.26/1021.15) lat (ms,95%): 44.17 err/s: 0.00 reconn/s: 0.00
[ 389s ] thds: 16 tps: 452.18 qps: 9072.52 (r/w/o: 6343.46/1767.69/961.37) lat (ms,95%): 47.47 err/s: 0.00 reconn/s: 0.00
[ 390s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 16.00
[ 391s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 392s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 393s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 394s ] thds: 16 tps: 0.00 qps: 196.99 (r/w/o: 180.99/0.00/16.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 395s ] thds: 16 tps: 0.00 qps: 107.01 (r/w/o: 43.00/63.00/1.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 396s ] thds: 16 tps: 295.99 qps: 5796.76 (r/w/o: 4069.83/1096.96/629.97) lat (ms,95%): 6135.91 err/s: 0.00 reconn/s: 0.00
[ 397s ] thds: 16 tps: 455.00 qps: 9167.94 (r/w/o: 6423.96/1778.99/964.99) lat (ms,95%): 52.89 err/s: 0.00 reconn/s: 0.00
[ 398s ] thds: 16 tps: 432.84 qps: 8544.75 (r/w/o: 5972.73/1651.37/920.65) lat (ms,95%): 58.92 err/s: 0.00 reconn/s: 0.00
[ 399s ] thds: 16 tps: 474.22 qps: 9628.50 (r/w/o: 6732.14/1883.88/1012.47) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00


Hi,
As you have also identified, and my tests confirm that, you have a very brief service interruption when doing writes only.

 18s ] thds: 16 tps: 9.96 qps: 42.84 (r/w/o: 0.00/25.90/16.94) lat (ms,95%): 1973.38 err/s 0.00 reconn/s: 0.00
[ 19s ] thds: 16 tps: 17.97 qps: 68.88 (r/w/o: 0.00/35.94/32.94) lat (ms,95%): 2120.76 err/s 0.00 reconn/s: 0.00
[ 20s ] thds: 16 tps: 25.35 qps: 95.33 (r/w/o: 0.00/53.75/41.58) lat (ms,95%): 1376.60 err/s 0.00 reconn/s: 0.00 <---------------------- last good one
[ 21s ] thds: 16 tps: 0.93 qps: 13.05 (r/w/o: 0.00/4.66/8.39) lat (ms,95%): 1050.76 err/s 0.00 reconn/s: 0.00
[ 22s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 16.70       <---------------------- reconnect to new writer
[ 23s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 24s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 25s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 26s ] thds: 16 tps: 0.00 qps: 28.48 (r/w/o: 0.00/0.00/28.48) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 27s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00

[ 28s ] thds: 16 tps: 63.58 qps: 254.32 (r/w/o: 0.00/143.42/110.89) lat (ms,95%): 8484.79 err/s 0.00 reconn/s: 0.00

If you check ProxySQL it is very fast in detecting that the writer node went down and it is also very fast in replacing it with a new writer. Most of the time is due to sysbench itself that needs to re-initialise the treads connections.

This is it, in case of write, where the node is fully crashed you cannot expect to have proxysql to be able to recover the write that is on the fly. This needs to be covered by the application with a proper error management approach. 

If you perform tests with reads, you will see that ProxySQL is able to shift them from a node to another. 

If you perform a proper shutdown or maintenance, you will see that Proxysql is shifting the writer without having connections terminated. 

But I do not think it is a fair expectation to have Proxysql cover something that should be implemented in the app code, especially with the given conditions, no use of transactions (autocommit=1) .

In all my codes I implement a try{}catch block to perform a retry operation in case of issue, such that the application will minimize the impact covering with a simple retry, the possible small gap existing given a node crash or similar events.

Also note that your settings are valid only for this specific write testing and not for a real production solution.


Hi Marco,

thank you for your detailed answer. I understand, that regarding the db writes, we have to implement some kind of error handling into our app code, to be able to verify and retry the possible unsuccessful transactions.

To be honest, I somehow believed, that ProxySQL can preserve the client connection in case of backend node failure and hand it over to the new writer. Even more, that in case the client is not using transactions, it will automatically retry the sql statement to new writer.

Now I’m wondering how to measure the delay between the former writer recieve SIGKILL and the new promoted writer start accepting new connections. In other words, how to check how long the service interruption will take place? Is there some script for that?

Many thanks