Master hanging with slave stuck on state init, info SAVEPOINT

We have a setup with one master running PXC and one slave. Everything works great except that occasionally one of the replication threads on the slave will hang on something like this.

| 39 | system user | | db | Sleep | 285 | init | SAVEPOINT s140016047687424_x45095 |

These can take up to 20 minutes to finish, and while they are running the master isn’t running any more commands and all of the threads on the master are waiting in wsrep_pre_commit. The only solution other than waiting is to restart mysql on the slave which drops the cluster to 1 node and the master starts working again, and the slave does an IST.

Using savepoints does not cause any problems for me normally. Can you share more information, i.e. what is the output of
show status like ‘ws%’;
from both master node and slave during such savepoint stalls?

We had it happen again and I’ve got these this time.

Main DB:

While down -

mysql> show status like '%wsrep%';
+------------------------------+--------------------------------------+
| Variable_name | Value |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_protocol_version | 5 |
| wsrep_last_committed | 231400086 |
| wsrep_replicated | 6140774 |
| wsrep_replicated_bytes | 15650843998 |
| wsrep_repl_keys | 58701660 |
| wsrep_repl_keys_bytes | 611089291 |
| wsrep_repl_data_bytes | 10476034924 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 49991 |
| wsrep_received_bytes | 400270 |
| wsrep_local_commits | 6140774 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 190 |
| wsrep_local_send_queue_avg | 0.009599 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.005021 |
| wsrep_local_cached_downto | 231367756 |
| wsrep_flow_control_paused_ns | 601363003898 |
| wsrep_flow_control_paused | 0.001168 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 33 |
| wsrep_cert_deps_distance | 33.923281 |
| wsrep_apply_oooe | 0.009462 |
| wsrep_apply_oool | 0.000005 |
| wsrep_apply_window | 1.013666 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.004106 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 184182 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.025854 |
| wsrep_incoming_addresses | ,10.0.11.32:3306 |
| wsrep_cluster_conf_id | 2 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info&#64;codership.com> |
| wsrep_provider_version | 3.5(r178) |
| wsrep_ready | ON |
+------------------------------+--------------------------------------+
47 rows in set (0.00 sec)

Some of the 200+ queries stuck in this state -

| 9476801 | dbuser | 10.0.11.24:56432 | dbname | Query | 179 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484122 | dbuser | 10.0.11.27:35341 | dbname | Query | 179 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484123 | dbuser | 10.0.11.27:35344 | dbname | Query | 179 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484124 | dbuser | 10.0.11.26:49171 | dbname | Query | 179 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484125 | dbuser | 10.0.11.25:39560 | dbname | Query | 179 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484131 | dbuser | 10.0.11.24:33790 | dbname | Query | 178 | wsrep in pre-commit stage | INSERT INTO tb1 (g_id, q, r, r_short) VALUES (28592235, 'xe0
t | 0 | 0 |
| 9484134 | dbuser | 10.0.11.24:33799 | dbname | Query | 178 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484140 | dbuser | 10.0.11.26:49184 | dbname | Query | 178 | wsrep in pre-commit stage | INSERT INTO tb1 (g_id, q, r, r_short) VALUES
(28598937, 'ik9v | 0 | 0 |
| 9484143 | dbuser | 10.0.11.24:33806 | dbname | Query | 178 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484148 | dbuser | 10.0.11.26:49189 | dbname | Query | 178 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484149 | dbuser | 10.0.11.27:35363 | dbname | Query | 178 | wsrep in pre-commit stage | commit
| 0 | 0 |
| 9484151 | dbuser | 10.0.11.25:39572 | dbname | Query | 178 | wsrep in pre-commit stage | commit
| 0 | 0 |

Once back up -

mysql> show status like '%wsrep%';
+------------------------------+--------------------------------------+
| Variable_name | Value |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_protocol_version | 5 |
| wsrep_last_committed | 231403297 |
| wsrep_replicated | 6143985 |
| wsrep_replicated_bytes | 15655416221 |
| wsrep_repl_keys | 58733820 |
| wsrep_repl_keys_bytes | 611420652 |
| wsrep_repl_data_bytes | 10480070282 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 50023 |
| wsrep_received_bytes | 400636 |
| wsrep_local_commits | 6143985 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.009632 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.005118 |
| wsrep_local_cached_downto | 231367901 |
| wsrep_flow_control_paused_ns | 611924764021 |
| wsrep_flow_control_paused | 0.001188 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 33 |
| wsrep_cert_deps_distance | 33.932388 |
| wsrep_apply_oooe | 0.009558 |
| wsrep_apply_oool | 0.000006 |
| wsrep_apply_window | 1.017967 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.008289 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 353 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.030719 |
| wsrep_incoming_addresses | |
| wsrep_cluster_conf_id | 3 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info&#64;codership.com> |
| wsrep_provider_version | 3.5(r178) |
| wsrep_ready | ON |
+------------------------------+--------------------------------------+
47 rows in set (0.00 sec)

Slave Only DB:

mysql> show status like '%wsrep%';
+------------------------------+--------------------------------------+
| Variable_name | Value |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_protocol_version | 5 |
| wsrep_last_committed | 231399533 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 6189064 |
| wsrep_received_bytes | 15648774742 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 513 |
| wsrep_local_recv_queue_avg | 448.466853 |
| wsrep_local_cached_downto | 231367756 |
| wsrep_flow_control_paused_ns | 588557767118 |
| wsrep_flow_control_paused | 0.001143 |
| wsrep_flow_control_sent | 33 |
| wsrep_flow_control_recv | 33 |
| wsrep_cert_deps_distance | 60.411327 |
| wsrep_apply_oooe | 0.015168 |
| wsrep_apply_oool | 0.000021 |
| wsrep_apply_window | 1.174553 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.155497 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 182827 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.025856 |
| wsrep_incoming_addresses | ,10.0.11.32:3306 |
| wsrep_cluster_conf_id | 2 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info&#64;codership.com> |
| wsrep_provider_version | 3.5(r178) |
| wsrep_ready | ON |
+------------------------------+--------------------------------------+
47 rows in set (0.00 sec)

Back up and resynced -

+------------------------------+--------------------------------------+
| Variable_name | Value |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_protocol_version | 5 |
| wsrep_last_committed | 231414410 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 4365 |
| wsrep_received_bytes | 8413649 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 71.833219 |
| wsrep_local_cached_downto | 231410091 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 64.214583 |
| wsrep_apply_oooe | 0.618891 |
| wsrep_apply_oool | 0.000070 |
| wsrep_apply_window | 29.812273 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 24.058852 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 34 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.041435 |
| wsrep_incoming_addresses | 10.0.11.32:3306, |
| wsrep_cluster_conf_id | 4 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 2f59a082-a179-11e2-0800-b01a353c36fc |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info&#64;codership.com> |
| wsrep_provider_version | 3.5(r178) |
| wsrep_ready | ON |
+------------------------------+--------------------------------------+
47 rows in set (0.00 sec)

Slave Only Continued -

Processes -

mysql> show processlist;
+------+-------------+-----------------+------------+---------+--------+-----------------------------+-------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+-------------+-----------------+------------+---------+--------+-----------------------------+-------------------------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 514749 | wsrep aborter idle | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 237 | applied write set 231399545 | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 238 | applied write set 231399540 | NULL | 0 | 0 |
| 4 | system user | | NULL | Sleep | 239 | committed 231399527 | NULL | 0 | 0 |
| 5 | system user | | NULL | Sleep | 238 | applied write set 231399536 | NULL | 0 | 0 |
| 6 | system user | | NULL | Sleep | 239 | committed 231399532 | NULL | 0 | 0 |
| 7 | system user | | dbname| Sleep | 439 | init | SAVEPOINT `s140320508479232_x53660` | 0 | 0 |
| 8 | system user | | NULL | Sleep | 238 | applied write set 231399539 | NULL | 0 | 0 |
| 9 | system user | | NULL | Sleep | 238 | applied write set 231399541 | NULL | 0 | 0 |
| 10 | system user | | NULL | Sleep | 240 | committed 231399516 | NULL | 0 | 0 |
| 11 | system user | | NULL | Sleep | 237 | applied write set 231399547 | NULL | 0 | 0 |
| 12 | system user | | NULL | Sleep | 240 | committed 231399517 | NULL | 0 | 0 |
| 13 | system user | | NULL | Sleep | 236 | applied write set 231399565 | NULL | 0 | 0 |
| 14 | system user | | NULL | Sleep | 238 | applied write set 231399538 | NULL | 0 | 0 |
| 15 | system user | | NULL | Sleep | 235 | applied write set 231399571 | NULL | 0 | 0 |
| 16 | system user | | NULL | Sleep | 237 | applied write set 231399551 | NULL | 0 | 0 |
| 17 | system user | | NULL | Sleep | 240 | committed 231399515 | NULL | 0 | 0 |
| 18 | system user | | NULL | Sleep | 240 | committed 231399509 | NULL | 0 | 0 |
| 19 | system user | | NULL | Sleep | 241 | committed 231399502 | NULL | 0 | 0 |
| 20 | system user | | NULL | Sleep | 236 | applied write set 231399562 | NULL | 0 | 0 |
| 21 | system user | | NULL | Sleep | 235 | applied write set 231399566 | NULL | 0 | 0 |
| 22 | system user | | NULL | Sleep | 237 | applied write set 231399544 | NULL | 0 | 0 |
| 23 | system user | | NULL | Sleep | 237 | applied write set 231399553 | NULL | 0 | 0 |
| 24 | system user | | NULL | Sleep | 240 | committed 231399524 | NULL | 0 | 0 |
| 25 | system user | | NULL | Sleep | 235 | applied write set 231399568 | NULL | 0 | 0 |
| 26 | system user | | NULL | Sleep | 239 | applied write set 231399535 | NULL | 0 | 0 |
| 27 | system user | | NULL | Sleep | 240 | committed 231399523 | NULL | 0 | 0 |
| 28 | system user | | NULL | Sleep | 240 | committed 231399508 | NULL | 0 | 0 |
| 29 | system user | | NULL | Sleep | 236 | applied write set 231399559 | NULL | 0 | 0 |
| 30 | system user | | NULL | Sleep | 235 | applied write set 231399573 | NULL | 0 | 0 |
| 31 | system user | | NULL | Sleep | 236 | applied write set 231399558 | NULL | 0 | 0 |
| 32 | system user | | NULL | Sleep | 238 | applied write set 231399537 | NULL | 0 | 0 |
| 33 | system user | | NULL | Sleep | 239 | committed 231399529 | NULL | 0 | 0 |
| 34 | system user | | NULL | Sleep | 237 | applied write set 231399550 | NULL | 0 | 0 |
| 35 | system user | | NULL | Sleep | 237 | applied write set 231399546 | NULL | 0 | 0 |
| 36 | system user | | NULL | Sleep | 241 | committed 231399503 | NULL | 0 | 0 |
| 37 | system user | | NULL | Sleep | 239 | committed 231399530 | NULL | 0 | 0 |
| 38 | system user | | NULL | Sleep | 236 | applied write set 231399560 | NULL | 0 | 0 |
| 39 | system user | | NULL | Sleep | 240 | committed 231399514 | NULL | 0 | 0 |
| 40 | system user | | NULL | Sleep | 237 | applied write set 231399552 | NULL | 0 | 0 |
| 41 | system user | | NULL | Sleep | 236 | applied write set 231399561 | NULL | 0 | 0 |
| 42 | newrelic | 127.0.0.1:56357 | NULL | Sleep | 7 | | NULL | 450 | 450 |
| 8299 | root | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
+------+-------------+-----------------+------------+---------+--------+-----------------------------+-------------------------------------+-----------+---------------+
43 rows in set (0.00 sec)