MySQL Freezing, but shell is showing only sleeping queries

We have an MySQL master-master replication setup with mysql-mmm in Ubuntu 20.04. We recently upgraded from CentOS 6 and MySQL 5.6 and kept most of the settings.
Both of the machines have 48GB of memory, 10 CPU’s and contains 250GB in of data in >100.000 tables spread over >400 databases.

But after the upgrade we are experiencing a couple of freezes per day, on the active and passive server.
Tcpdump is showing me that the queries (from mysql-mmm control vps) are coming and going, but it just takes a longer time and then the timeout of 10 seconds just kicks in.

slow log starts with a lot of “SHOW SLAVE STATUS” and after that a INSERT’s and a few DELETE calls, but not a single ‘normal’ SELECT query.

How can we debug this? Any help is greatly appreciated

mysql ini file

[mysqld]
auto_increment_increment = 3
auto_increment_offset = 2
basedir = /usr
bind-address = 0.0.0.0
binlog_format = ROW
binlog_gtid_simple_recovery = false
datadir = /var/lib/mysql
expire_logs_days = 10
innodb_buffer_pool_size = 32G
innodb_file_per_table
innodb_log_file_size = 8G
innodb_read_io_threads = 8
innodb_stats_on_metadata = 0
innodb_write_io_threads = 8
join_buffer_size = 2M
key_buffer_size = 32M
log-error = /var/log/mysql/error.log
log_bin = /var/log/mysql/mysql-bin.log
log_slave_updates
master_info_repository = table
max_allowed_packet = 64M
max_binlog_size = 2G
max_connections = 1000
open_files_limit = 5000
pid-file = /var/run/mysqld/mysqld.pid
port = 3306
query_cache_size = 0
query_cache_type = 0
read_buffer_size = 1M
read_rnd_buffer_size = 1M
relay-log = /var/log/mysql/mysql-relay-bin.log
relay_log_info_repository = table
server-id = 101
skip-external-locking
skip-name-resolve
slave_net_timeout = 3600
slow_launch_time = 2
slow_query_log
socket = /var/run/mysqld/mysqld.sock
sort_buffer_size = 1M
sql_mode = ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
ssl = false
ssl-ca = /etc/mysql/cacert.pem
ssl-cert = /etc/mysql/server-cert.pem
ssl-key = /etc/mysql/server-key.pem
table_open_cache = 1995
table_open_cache_instances = 6
thread_cache_size = 64
thread_stack = 256K
tmpdir = /tmp
user = mysql
wait_timeout = 360

slow log:

# Time: 2021-01-13T20:17:16.178603Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33069778
# Query_time: 113.954901  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.178651Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33070537
# Query_time: 105.950972  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.178754Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33070555
# Query_time: 105.389985  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.178895Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33071147
# Query_time: 97.946772  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179039Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33069831
# Query_time: 113.394795  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179071Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33071896
# Query_time: 89.943642  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179229Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33072722
# Query_time: 81.936055  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179250Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33071925
# Query_time: 89.382786  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179324Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33071181
# Query_time: 97.386881  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179366Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33072771
# Query_time: 81.378021  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;
# Time: 2021-01-13T20:17:16.179433Z
# User@Host: mmm_monitor[mmm_monitor] @  [10.0.0.2]  Id: 33073331
# Query_time: 73.928755  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1610569036;
SHOW SLAVE STATUS;

error file:

2021-01-13T20:17:16.182671Z 33075782 [Note] Got an error reading communication packets
2021-01-13T20:17:16.182910Z 33075785 [Note] Got an error reading communication packets
2021-01-13T20:17:16.182946Z 33075788 [Note] Got an error reading communication packets
2021-01-13T20:17:16.183181Z 33075793 [Note] Got an error reading communication packets
2021-01-13T20:17:16.184134Z 33075805 [Note] Got an error reading communication packets
2021-01-13T20:17:16.184501Z 33075812 [Note] Got an error reading communication packets
2021-01-13T20:17:16.184793Z 33075817 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185001Z 33075820 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185073Z 33075822 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185194Z 33075824 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185269Z 33075823 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185639Z 33075832 [Note] Got an error reading communication packets
2021-01-13T20:17:16.185684Z 33075835 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186003Z 33075841 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186025Z 33075842 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186073Z 33075843 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186135Z 33075844 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186224Z 33075845 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186296Z 33075846 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186412Z 33075847 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186460Z 33075848 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186583Z 33075849 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186631Z 33075850 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186717Z 33075851 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186805Z 33075852 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186834Z 33075853 [Note] Got an error reading communication packets
2021-01-13T20:17:16.186995Z 33075854 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187008Z 33075855 [Note] Got an error reading communication packets
2021-01-13T20:17:16.203678Z 33075753 [Note] Aborted connection 33075753 to db: ‘unconnected’ user: ‘zabbix’ host: ‘127.0.0.1’ (Got an error writing communication packets)
2021-01-13T20:17:16.187373Z 33075857 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187477Z 33075858 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187589Z 33075860 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187554Z 33075859 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187613Z 33075861 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187787Z 33075862 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187835Z 33075863 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187994Z 33075864 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188012Z 33075865 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188055Z 33075866 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188150Z 33075867 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188186Z 33075868 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188369Z 33075869 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188387Z 33075870 [Note] Got an error reading communication packets
2021-01-13T20:17:16.188402Z 33075871 [Note] Got an error reading communication packets
2021-01-13T20:17:16.187198Z 33075856 [Note] Got an error reading communication packets
2021-01-13T20:17:26.186038Z 33075791 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.186468Z 33075794 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.191292Z 33075755 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.191335Z 33075829 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.191391Z 33075757 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.191489Z 33075756 [Note] Got timeout reading communication packets
2021-01-13T20:17:26.191530Z 33075761 [Note] Got timeout reading communication packets

This is wrong, but not your problem. This should be ‘2’ on both servers. and auto-increment-offset should be 1 on server A and 2 on server B.

mysql-mmm is OLD, like almost 10 years OLD and not supported by anyone anymore. You should look into either using Percona XtraDB Cluster + ProxySQL, or if you want to keep the master/master, check out Orchestrator.

You said you upgraded from MySQL 5.6. To what version? 5.7? 8.0?

What does ‘SHOW PROCESSLIST’ give you during these events?

1 Like

This is wrong, but not your problem. This should be ‘2’ on both servers. and auto-increment-offset should be 1 on server A and 2 on server B.

That is just a historical value: from when it had a replication loop with 3 servers.

mysql-mmm is OLD, like almost 10 years OLD and not supported by anyone anymore. You should look into either using Percona XtraDB Cluster + ProxySQL, or if you want to keep the master/master, check out Orchestrator.

I know, the setup was made 5 years ago by the (managed) hosting company. I would rather now go with a Galera cluster, but the proposal has been shot down because there is no support by the (managed) hosting company (and indirect by the boss).

You said you upgraded from MySQL 5.6. To what version? 5.7? 8.0?

i was hoping this would be clear, because it was posted in the MySQL 5.7 thread, so yes it was upgraded to 5.7.

What does ‘SHOW PROCESSLIST’ give you during these events?

it does what the title of this post says: every query is sleeping

1 Like

Percona XtraDB Cluster is Galera cluster :slight_smile: “Galera” is just the name of the underlying communications library.

because it was posted in the MySQL 5.7 thread,

Haha. So many posts are posted in the wrong places, I never assume and always ask.

‘Sleep’ means ‘idle’. means no activity happening. If you repeat show processlist and those numbers in the ‘Time’ column increase, then mysql isn’t frozen.

I would enable all of the debugging logs possible for mysql-mmm and trace what is happening. Check ‘read_only’ on mysql and see if mmm is setting this incorrectly.

1 Like

oh i learned something =)

I would enable all of the debugging logs possible for mysql-mmm and trace what is happening.

yeah i did this as well, it’s giving me a timeout in the logs:

2021/01/15 06:31:21 ERROR Check ‘rep_backlog’ on ‘db02’ has failed for 16 seconds! Message: ERROR: Timeout
2021/01/15 06:31:21 ERROR Check ‘rep_threads’ on ‘db02’ has failed for 16 seconds! Message: ERROR: Timeout
2021/01/15 06:32:16 WARN Check ‘rep_threads’ on ‘db02’ is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.0.12:3306, user = mmm_monitor)! Lost connection to MySQL server at ‘waiting for initial communication packet’, system error: 110
2021/01/15 06:32:16 WARN Check ‘rep_backlog’ on ‘db02’ is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.0.12:3306, user = mmm_monitor)! Lost connection to MySQL server at ‘waiting for initial communication packet’, system error: 110
2021/01/15 06:32:27 ERROR Check ‘mysql’ on ‘db02’ has failed for 14 seconds! Message: ERROR: Connect error (host = 10.0.0.12:3306, user = mmm_monitor)! Lost connection to MySQL server at ‘waiting for initial communication packet’, system error: 110

I would enable all of the debugging logs possible for mysql-mmm and trace what is happenin

yes, i did this but i didn’t see more information then the general log would provide
and i checked this with tcpdump in the mysql-mmm control vps and it shows me that the query is going through the network, but it just takes a lot of time.

unless i interpreted this wrong (it’s my first time working with tcpdump):

Check ‘read_only’ on mysql and see if mmm is setting this incorrectly.

but then i would see this in the MySQL error log? What would i be able to find if i search in this direction?

1 Like

request:

1 Like

tcp ack/seq update

1 Like

If I understand this correctly, mmm is requesting ‘show slave status’, but mysql is taking a long time to reply, which causes mmm to timeout waiting for said reply?

What happens when you run ‘show slave status’ manually? Does it return right away?

1 Like

Yes, this is correct. But it only happens during the “freeze”. Normally the query returns instantly.

I tried running the deadlock logger, but it didn’t log anything.

1 Like

The deadlock logger is only for InnoDB and has nothing to do with “mysql”. Please explain “freeze”. I guess I’m not entirely understanding what is not working. During one of these “freeze” periods, can you still access the mysql CLI? If so, can you run SELECT queries directly? What about SHOW SLAVE STATUS? During a freeze, what does SELECT @@read_only return for both nodes?

1 Like

I have to write a script to catch this. Since it only “freezes” for 1 or 2 times a day. I’ll get those answers hopefully within a few days.

1 Like

So i could catch it on the passive server.
I created a script which would question every second the database with a “SHOW PROCESSLIST” query. If the script would catch a long running “SHOW SLAVE STATUS”, then another background process will try to execute another ‘SHOW SLAVE STATUS’. But that last query just stalls.
I removed a lot of lines of this log, since a lot of entries are redundant.

The “SHOW PROCESSLIST” output is in JSON.

2021-01-18 20:01:41
[{“Id”:“54908570”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“299649”,“State”:“Queueing master event to the relay log”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908571”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“6”,“State”:“System lock”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908618”,“User”:“replication”,“Host”:“10.0.0.12:28894”,“db”:null,“Command”:“Binlog Dump”,“Time”:“299637”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73314445”,“User”:“replication”,“Host”:“127.0.0.1:54910”,“db”:null,“Command”:“Binlog Dump”,“Time”:“21232”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73317611”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“0”,“State”:“starting”,“Info”:“SHOW PROCESSLIST”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345150”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54634”,“db”:null,“Command”:“Query”,“Time”:“2”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345151”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54636”,“db”:null,“Command”:“Query”,“Time”:“2”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”}]

2021-01-18 20:01:42
[{“Id”:“54908570”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“299650”,“State”:“Queueing master event to the relay log”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908571”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“7”,“State”:“System lock”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908618”,“User”:“replication”,“Host”:“10.0.0.12:28894”,“db”:null,“Command”:“Binlog Dump”,“Time”:“299638”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73314445”,“User”:“replication”,“Host”:“127.0.0.1:54910”,“db”:null,“Command”:“Binlog Dump”,“Time”:“21233”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73317611”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“0”,“State”:“starting”,“Info”:“SHOW PROCESSLIST”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345150”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54634”,“db”:null,“Command”:“Query”,“Time”:“3”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345151”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54636”,“db”:null,“Command”:“Query”,“Time”:“3”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345152”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“1”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”}]

2021-01-18 20:01:48
[{“Id”:“54908570”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“299656”,“State”:“Queueing master event to the relay log”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908571”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“13”,“State”:“System lock”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908618”,“User”:“replication”,“Host”:“10.0.0.12:28894”,“db”:null,“Command”:“Binlog Dump”,“Time”:“299644”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73314445”,“User”:“replication”,“Host”:“127.0.0.1:54910”,“db”:null,“Command”:“Binlog Dump”,“Time”:“21239”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73317611”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“0”,“State”:“starting”,“Info”:“SHOW PROCESSLIST”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345150”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54634”,“db”:null,“Command”:“Query”,“Time”:“9”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345151”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54636”,“db”:null,“Command”:“Query”,“Time”:“9”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345152”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“7”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345155”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54664”,“db”:null,“Command”:“Query”,“Time”:“1”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345156”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54666”,“db”:null,“Command”:“Query”,“Time”:“1”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”}]

2021-01-18 20:01:56
[{“Id”:“54908570”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“299664”,“State”:“Queueing master event to the relay log”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908571”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“21”,“State”:“System lock”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908618”,“User”:“replication”,“Host”:“10.0.0.12:28894”,“db”:null,“Command”:“Binlog Dump”,“Time”:“299652”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73314445”,“User”:“replication”,“Host”:“127.0.0.1:54910”,“db”:null,“Command”:“Binlog Dump”,“Time”:“21247”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73317611”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“0”,“State”:“starting”,“Info”:“SHOW PROCESSLIST”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345150”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54634”,“db”:null,“Command”:“Query”,“Time”:“17”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345151”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54636”,“db”:null,“Command”:“Query”,“Time”:“17”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345152”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“15”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345155”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54664”,“db”:null,“Command”:“Query”,“Time”:“9”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345156”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54666”,“db”:null,“Command”:“Query”,“Time”:“9”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345158”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54688”,“db”:null,“Command”:“Query”,“Time”:“1”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345159”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54690”,“db”:null,“Command”:“Query”,“Time”:“1”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”}]

2021-01-18 20:03:47
[{“Id”:“54908570”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“299775”,“State”:“Queueing master event to the relay log”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908571”,“User”:“system user”,“Host”:"",“db”:null,“Command”:“Connect”,“Time”:“132”,“State”:“System lock”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“54908618”,“User”:“replication”,“Host”:“10.0.0.12:28894”,“db”:null,“Command”:“Binlog Dump”,“Time”:“299763”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73314445”,“User”:“replication”,“Host”:“127.0.0.1:54910”,“db”:null,“Command”:“Binlog Dump”,“Time”:“21358”,“State”:“Master has sent all binlog to slave; waiting for more updates”,“Info”:null,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73317611”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“0”,“State”:“starting”,“Info”:“SHOW PROCESSLIST”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345150”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54634”,“db”:null,“Command”:“Query”,“Time”:“128”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345151”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54636”,“db”:null,“Command”:“Query”,“Time”:“128”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345152”,“User”:“admin”,“Host”:“localhost”,“db”:“banen”,“Command”:“Query”,“Time”:“126”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345155”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54664”,“db”:null,“Command”:“Query”,“Time”:“120”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345156”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54666”,“db”:null,“Command”:“Query”,“Time”:“120”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345158”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54688”,“db”:null,“Command”:“Query”,“Time”:“112”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345159”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54690”,“db”:null,“Command”:“Query”,“Time”:“112”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345162”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54720”,“db”:null,“Command”:“Query”,“Time”:“104”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345163”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54722”,“db”:null,“Command”:“Query”,“Time”:“104”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345165”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54750”,“db”:null,“Command”:“Query”,“Time”:“96”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345166”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54752”,“db”:null,“Command”:“Query”,“Time”:“96”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345169”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54778”,“db”:null,“Command”:“Query”,“Time”:“88”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345170”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54780”,“db”:null,“Command”:“Query”,“Time”:“88”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345173”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54808”,“db”:null,“Command”:“Query”,“Time”:“80”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345174”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54810”,“db”:null,“Command”:“Query”,“Time”:“80”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345176”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54838”,“db”:null,“Command”:“Query”,“Time”:“72”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345177”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54840”,“db”:null,“Command”:“Query”,“Time”:“72”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345180”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54866”,“db”:null,“Command”:“Query”,“Time”:“64”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345181”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54868”,“db”:null,“Command”:“Query”,“Time”:“64”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345183”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54892”,“db”:null,“Command”:“Query”,“Time”:“56”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345184”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54894”,“db”:null,“Command”:“Query”,“Time”:“56”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345187”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54924”,“db”:null,“Command”:“Query”,“Time”:“48”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345188”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54926”,“db”:null,“Command”:“Query”,“Time”:“48”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345191”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54950”,“db”:null,“Command”:“Query”,“Time”:“40”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345192”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54952”,“db”:null,“Command”:“Query”,“Time”:“40”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345194”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54978”,“db”:null,“Command”:“Query”,“Time”:“32”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345195”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:54980”,“db”:null,“Command”:“Query”,“Time”:“32”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345198”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55010”,“db”:null,“Command”:“Query”,“Time”:“24”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345199”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55012”,“db”:null,“Command”:“Query”,“Time”:“24”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345201”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55034”,“db”:null,“Command”:“Query”,“Time”:“16”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345202”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55036”,“db”:null,“Command”:“Query”,“Time”:“16”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345205”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55066”,“db”:null,“Command”:“Query”,“Time”:“8”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”},{“Id”:“73345206”,“User”:“mmm_monitor”,“Host”:“10.0.0.2:55068”,“db”:null,“Command”:“Query”,“Time”:“8”,“State”:“starting”,“Info”:“SHOW SLAVE STATUS”,“Rows_sent”:“0”,“Rows_examined”:“0”}]

2021-01-18 20:03:47
[{“Slave_IO_State”:“Queueing master event to the relay log”,“Master_Host”:“10.0.0.12”,“Master_User”:“replication”,“Master_Port”:“3306”,“Connect_Retry”:“60”,“Master_Log_File”:“mysql-bin.000212”,“Read_Master_Log_Pos”:“199323503”,“Relay_Log_File”:“mysql-relay-bin.000594”,“Relay_Log_Pos”:“199314741”,“Relay_Master_Log_File”:“mysql-bin.000212”,“Slave_IO_Running”:“Yes”,“Slave_SQL_Running”:“Yes”,“Replicate_Do_DB”:"",“Replicate_Ignore_DB”:"",“Replicate_Do_Table”:"",“Replicate_Ignore_Table”:"",“Replicate_Wild_Do_Table”:"",“Replicate_Wild_Ignore_Table”:"",“Last_Errno”:“0”,“Last_Error”:"",“Skip_Counter”:“0”,“Exec_Master_Log_Pos”:“199314528”,“Relay_Log_Space”:“199323970”,“Until_Condition”:“None”,“Until_Log_File”:"",“Until_Log_Pos”:“0”,“Master_SSL_Allowed”:“No”,“Master_SSL_CA_File”:"",“Master_SSL_CA_Path”:"",“Master_SSL_Cert”:"",“Master_SSL_Cipher”:"",“Master_SSL_Key”:"",“Seconds_Behind_Master”:“132”,“Master_SSL_Verify_Server_Cert”:“No”,“Last_IO_Errno”:“0”,“Last_IO_Error”:"",“Last_SQL_Errno”:“0”,“Last_SQL_Error”:"",“Replicate_Ignore_Server_Ids”:"",“Master_Server_Id”:“101”,“Master_UUID”:“561e20ee-2f13-11eb-a8c0-001a4a6fbf77”,“Master_Info_File”:“mysql.slave_master_info”,“SQL_Delay”:“0”,“SQL_Remaining_Delay”:null,“Slave_SQL_Running_State”:“System lock”,“Master_Retry_Count”:“86400”,“Master_Bind”:"",“Last_IO_Error_Timestamp”:"",“Last_SQL_Error_Timestamp”:"",“Master_SSL_Crl”:"",“Master_SSL_Crlpath”:"",“Retrieved_Gtid_Set”:"",“Executed_Gtid_Set”:"",“Auto_Position”:“0”,“Replicate_Rewrite_DB”:"",“Channel_Name”:"",“Master_TLS_Version”:""}]

there is also nothing in the error log:

You clearly have some issue with replication status. I can only make educated guesses here. Each time SSS is ran, internal locks must acquired. Running SSS so frequently may be causing internal deadlocks (note I said internal deadlocks and not SQL/InnoDB deadlocks, huge difference, as internal’s can’t be tracked) which is eventually crashing the server.

I notice you have binlog_gtid_simple_recovery=false, which is non-default. Any specific reason for this? I also do not see any GTID related options.

As a final recommendation, I would switch to GTID-based replication, put the above parameter back to default (just remove it from my.cnf), and relax the frequency that which mmm polls the server. More long-term, switch off mmm and move to Orchestrator.

Sorry for not having a silver bullet solution. You can always get in touch with our support or professional services team for a more hands-on analysis.

1 Like

I did enable the GTID-based replication and removed that flag.
Now both the MySQL servers are running error free for 7 days, so i think that was it.

Thanks for your help!

1 Like