Insert takes 13 seconds

From the slow log analysis:

Attribute pct total min max avg 95% stddev median

============ === ======= ======= ======= ======= ======= ======= =======

Count 2 1

Exec time 18 13s 13s 13s 13s 13s 0 13s

Lock time 1 57us 57us 57us 57us 57us 0 57us

Rows sent 0 0 0 0 0 0 0 0

Rows examine 0 0 0 0 0 0 0 0

Rows affecte 4 1 1 1 1 1 0 1

Rows read 0 0 0 0 0 0 0 0

Bytes sent 0 13 13 13 13 13 0 13

Query size 0 321 321 321 321 321 0 321

#---------------------------------------------------------------------------------------------

SHOW TABLE STATUS FROM ***** LIKE ‘chat_messages’\G

SHOW CREATE TABLE *****.chat_messages\G

INSERT INTO *****.chat_messages (username, profiles_profile_id, show_character_id, chat_room_id, body, system, prefix, to_character_id, to_character_name, created) VALUES (‘sdsdfg’, NULL, 49, 3, 'sdfgsdfgxcfgsdfg ', ‘1’, ‘thinks’, NULL, NULL, ‘2013-10-02 08:29:02’)\G

These are all the indexes on the table: [TABLE=“class: wrapper”]
[TR]
[TD] [TABLE=“class: ui_table sortable ui_columns, width: 1253”]
[TR=“class: mainbody row1, bgcolor: #F8F8FA”]
[TD]chat_messages[/TD]
[TD]0[/TD]
[TD]PRIMARY[/TD]
[TD]1[/TD]
[TD]id[/TD]
[TD]A[/TD]
[TD]28718[/TD]
[TD] [/TD]
[TD] [/TD]
[TD] [/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainbody row0, bgcolor: #E8E8EA”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]IND_ChatMessages_Created[/TD]
[TD]1[/TD]
[TD]created[/TD]
[TD]A[/TD]
[TD]28718[/TD]
[TD] [/TD]
[TD] [/TD]
[TD] [/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainhigh, bgcolor: #FFFFBB”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]IND_ChatMessages_Username[/TD]
[TD]1[/TD]
[TD]username[/TD]
[TD]A[/TD]
[TD]9572[/TD]
[TD] [/TD]
[TD] [/TD]
[TD]YES[/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainbody row0, bgcolor: #E8E8EA”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]FK_ChatMessages_ChatRoomId[/TD]
[TD]1[/TD]
[TD]chat_room_id[/TD]
[TD]A[/TD]
[TD]1[/TD]
[TD] [/TD]
[TD] [/TD]
[TD] [/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainbody row1, bgcolor: #F8F8FA”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]IND_ChatMessages_Broadcast[/TD]
[TD]1[/TD]
[TD]broadcast[/TD]
[TD]A[/TD]
[TD]1[/TD]
[TD] [/TD]
[TD] [/TD]
[TD] [/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainbody row0, bgcolor: #E8E8EA”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]FK_ChatMessages_ProfliesProfileId[/TD]
[TD]1[/TD]
[TD]profiles_profile_id[/TD]
[TD]A[/TD]
[TD]9572[/TD]
[TD] [/TD]
[TD] [/TD]
[TD]YES[/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[TR=“class: mainbody row1, bgcolor: #F8F8FA”]
[TD]chat_messages[/TD]
[TD]1[/TD]
[TD]FK_ChatMessages_ShowCharacterId[/TD]
[TD]1[/TD]
[TD]show_character_id[/TD]
[TD]A[/TD]
[TD]106[/TD]
[TD] [/TD]
[TD] [/TD]
[TD]YES[/TD]
[TD]BTREE[/TD]
[TD] [/TD]
[TD] [/TD]
[/TR]
[/TABLE]
[/TD]
[/TR]
[/TABLE]

The system was part of a three way cluster which froze up last Saturday. I had to restart Percona in bootstrap mode to get it up and am still running on one system only, in bootstrap mode. I want to get these slow queries sorted out before getting the cluster back up, because I think it was the main cause for the cluster failure.
The table has 28718 rows. Any ideas why inserting a new row takes so long?

I’ve moved this thread to PXC Cluster category since apparently we are dealing with XtraDB Cluster here, correct?
This is InnoDB table, right?
By “froze up” you mean no query could execute on any node during this time?

For that incident the slow log isn’t very much helpful. If you had log_slow_verbosity=profiling then maybe we could see more. In this case original slow log entry would be needed.

But this incident should have some corresponding info in the MySQL error logs. Can you check on all 3 nodes if you can see anything meaningful? Also the processlist snapshot from that moment would be helpful. You can capture it automatically if this kind of issue appears with help of pt-stalk tool btw.

All these tables are innoBD tables.

The server crashed completely last night. Froze and then it got stuck on GRUB bootloader on reboot and thus the whole site was off for 6 hours. Fixed at 2.30am by on-site engineer. It is an enterprize-class vm with 32GB RAM and 12 cores. Load average: 0.27 and ram usage 1.8GB out of 32GB…

Here is the original slow log of that incident; there were many similar ones:

SET timestamp=1380695355;
INSERT INTO ####2.chat_messages (username, profiles_profile_id, show_character_id, chat_room_id, body, system, prefix, to_character_id, to_character_name, created) VALUES (‘asdfasdfasdf’, NULL, 49, 3, ‘asdfasdfasdfasdf’, ‘1’, ‘thinks’, NULL, NULL, ‘2013-10-02 08:29:02’);

Time: 131002 8:29:16

User@Host: [###] @ [10.0.0.3]

Thread_id: 1141351 Schema: ### Last_errno: 0 Killed: 0

Query_time: 10.851373 Lock_time: 0.000059 Rows_sent: 0 Rows_examined: 1 Rows_affected: 1 Rows_read: 1

Bytes_sent: 52

Another example: here is a simple update that took 5 seconds directly after the previous etry:
SET timestamp=1380695356;
UPDATE ####2.profiles_profiles SET latest_activity_type = ‘chat_room_joined’, modified = ‘2013-10-02 08:29:05’ WHERE ###2.profiles_profiles.id = ‘64’;

User@Host: [###r] @ [10.0.0.7]

Thread_id: 1141366 Schema: ### Last_errno: 0 Killed: 0

Query_time: 5.636290 Lock_time: 0.000044 Rows_sent: 0 Rows_examined: 1 Rows_affected: 1 Rows_read: 1

Bytes_sent: 52

I did a new slow log capture after the reboot, and some inserts still took 1.4 seconds:

SET timestamp=1380781348;
INSERT INTO ###2.profiles_profiles (gender, modified, created) VALUES (NULL, ‘2013-10-03 08:22:26’, ‘2013-10-03 08:22:26’);

User@Host: [###] @ [10.0.0.10]

Thread_id: 50302 Schema: ### Last_errno: 1364 Killed: 0

Query_time: 1.423764 Lock_time: 0.000117 Rows_sent: 0 Rows_examined: 0 Rows_affected: 1 Rows_read: 0

Bytes_sent: 14

SET timestamp=1380781348;
INSERT INTO ###2.profiles_profiles (gender, modified, created) VALUES (NULL, ‘2013-10-03 08:22:26’, ‘2013-10-03 08:22:26’);

User@Host: [###r] @ [10.0.0.3]

Thread_id: 50295 Schema: ###2 Last_errno: 1048 Killed: 0

Query_time: 0.245315 Lock_time: 0.000055 Rows_sent: 0 Rows_examined: 1 Rows_affected: 1 Rows_read: 1

Bytes_sent: 52

I was running a cluster, but the other two systems are not part of it anymore - does running in bootstrap mode impact performance?

Current global status:
±-----------------------------------------±-------------------------------------+
| Variable_name | Value |
±-----------------------------------------±-------------------------------------+
| Aborted_clients | 2 |
| Aborted_connects | 5 |
| Binlog_cache_disk_use | 0 |
| Binlog_cache_use | 598 |
| Binlog_stmt_cache_disk_use | 0 |
| Binlog_stmt_cache_use | 0 |
| Bytes_received | 72063396 |
| Bytes_sent | 490881621 |
| Com_admin_commands | 0 |
| Com_assign_to_keycache | 0 |
| Com_alter_db | 0 |
| Com_alter_db_upgrade | 0 |
| Com_alter_event | 0 |
| Com_alter_function | 0 |
| Com_alter_procedure | 0 |
| Com_alter_server | 0 |
| Com_alter_table | 0 |
| Com_alter_tablespace | 0 |
| Com_analyze | 0 |
| Com_begin | 60 |
| Com_binlog | 0 |
| Com_call_procedure | 0 |
| Com_change_db | 15 |
| Com_change_master | 0 |
| Com_check | 0 |
| Com_checksum | 0 |
| Com_commit | 54 |
| Com_create_db | 0 |
| Com_create_event | 0 |
| Com_create_function | 0 |
| Com_create_index | 12 |
| Com_create_procedure | 0 |
| Com_create_server | 0 |
| Com_create_table | 6 |
| Com_create_trigger | 0 |
| Com_create_udf | 0 |
| Com_create_user | 0 |
| Com_create_view | 0 |
| Com_dealloc_sql | 24 |
| Com_delete | 102 |
| Com_delete_multi | 8 |
| Com_do | 0 |
| Com_drop_db | 0 |
| Com_drop_event | 0 |
| Com_drop_function | 0 |
| Com_drop_index | 0 |
| Com_drop_procedure | 0 |
| Com_drop_server | 0 |
| Com_drop_table | 0 |
| Com_drop_trigger | 0 |
| Com_drop_user | 0 |
| Com_drop_view | 0 |
| Com_empty_query | 0 |
| Com_execute_sql | 24 |
| Com_flush | 0 |
| Com_grant | 0 |
| Com_ha_close | 0 |
| Com_ha_open | 0 |
| Com_ha_read | 0 |
| Com_help | 0 |
| Com_insert | 401 |
| Com_insert_select | 19 |
| Com_install_plugin | 0 |
| Com_kill | 0 |
| Com_load | 6 |
| Com_lock_tables | 0 |
| Com_optimize | 0 |
| Com_preload_keys | 0 |
| Com_prepare_sql | 24 |
| Com_purge | 0 |
| Com_purge_before_date | 0 |
| Com_release_savepoint | 0 |
| Com_rename_table | 0 |
| Com_rename_user | 0 |
| Com_repair | 0 |
| Com_replace | 0 |
| Com_replace_select | 0 |
| Com_reset | 0 |
| Com_resignal | 0 |
| Com_revoke | 0 |
| Com_revoke_all | 0 |
| Com_rollback | 6 |
| Com_rollback_to_savepoint | 0 |
| Com_savepoint | 0 |
| Com_select | 83630 |
| Com_set_option | 14246 |
| Com_signal | 0 |
| Com_show_authors | 0 |
| Com_show_binlog_events | 0 |
| Com_show_binlogs | 0 |
| Com_show_charsets | 0 |
| Com_show_client_statistics | 0 |
| Com_show_collations | 0 |
| Com_show_contributors | 0 |
| Com_show_create_db | 0 |
| Com_show_create_event | 0 |
| Com_show_create_func | 0 |
| Com_show_create_proc | 0 |
| Com_show_create_table | 0 |
| Com_show_create_trigger | 0 |
| Com_show_databases | 1 |
| Com_show_engine_logs | 0 |
| Com_show_engine_mutex | 0 |
| Com_show_engine_status | 0 |
| Com_show_events | 0 |
| Com_show_errors | 0 |
| Com_show_fields | 3196 |
| Com_show_function_status | 0 |
| Com_show_grants | 0 |
| Com_show_index_statistics | 0 |
| Com_show_keys | 114 |
| Com_show_master_status | 0 |
| Com_show_open_tables | 0 |
| Com_show_plugins | 0 |
| Com_show_privileges | 0 |
| Com_show_procedure_status | 0 |
| Com_show_processlist | 0 |
| Com_show_profile | 0 |
| Com_show_profiles | 0 |
| Com_show_relaylog_events | 0 |
| Com_show_slave_hosts | 0 |
| Com_show_slave_status | 0 |
| Com_show_slave_status_nolock | 0 |
| Com_show_status | 40097 |
| Com_show_storage_engines | 0 |
| Com_show_table_statistics | 0 |
| Com_show_table_status | 1 |
| Com_show_tables | 888 |
| Com_show_temporary_tables | 0 |
| Com_show_thread_statistics | 0 |
| Com_show_triggers | 0 |
| Com_show_user_statistics | 0 |
| Com_show_variables | 6 |
| Com_show_warnings | 0 |
| Com_slave_start | 0 |
| Com_slave_stop | 0 |
| Com_stmt_close | 24 |
| Com_stmt_execute | 24 |
| Com_stmt_fetch | 0 |
| Com_stmt_prepare | 24 |
| Com_stmt_reprepare | 0 |
| Com_stmt_reset | 0 |
| Com_stmt_send_long_data | 0 |
| Com_truncate | 6 |
| Com_uninstall_plugin | 0 |
| Com_unlock_tables | 0 |
| Com_update | 227 |
| Com_update_multi | 45 |
| Com_xa_commit | 0 |
| Com_xa_end | 0 |
| Com_xa_prepare | 0 |
| Com_xa_recover | 0 |
| Com_xa_rollback | 0 |
| Com_xa_start | 0 |
| Compression | OFF |
| Connections | 54296 |
| Created_tmp_disk_tables | 3418 |
| Created_tmp_files | 31 |
| Created_tmp_tables | 46206 |
| Delayed_errors | 0 |
| Delayed_insert_threads | 0 |
| Delayed_writes | 0 |
| Flashcache_enabled | OFF |
| Flush_commands | 1 |
| Handler_commit | 86387 |
| Handler_delete | 32 |
| Handler_discover | 0 |
| Handler_prepare | 2057 |
| Handler_read_first | 5743 |
| Handler_read_key | 1248365 |
| Handler_read_last | 246 |
| Handler_read_next | 3579643 |
| Handler_read_prev | 12283 |
| Handler_read_rnd | 69947 |
| Handler_read_rnd_next | 59906106 |
| Handler_rollback | 26 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 449840 |
| Handler_write | 624521 |
| Innodb_adaptive_hash_cells | 276671 |
| Innodb_adaptive_hash_heap_buffers | 204 |
| Innodb_adaptive_hash_hash_searches | 1457179 |
| Innodb_adaptive_hash_non_hash_searches | 257712 |
| Innodb_background_log_sync | 17765 |
| Innodb_buffer_pool_pages_data | 7276 |
| Innodb_buffer_pool_bytes_data | 238419968 |
| Innodb_buffer_pool_pages_dirty | 0 |
| Innodb_buffer_pool_bytes_dirty | 0 |
| Innodb_buffer_pool_pages_flushed | 3403 |
| Innodb_buffer_pool_pages_LRU_flushed | 0 |
| Innodb_buffer_pool_pages_free | 710 |
| Innodb_buffer_pool_pages_made_not_young | 0 |
| Innodb_buffer_pool_pages_made_young | 6 |
| Innodb_buffer_pool_pages_misc | 205 |
| Innodb_buffer_pool_pages_old | 2705 |
| Innodb_buffer_pool_pages_total | 8191 |
| Innodb_buffer_pool_read_ahead_rnd | 0 |
| Innodb_buffer_pool_read_ahead | 3599 |
| Innodb_buffer_pool_read_ahead_evicted | 0 |
| Innodb_buffer_pool_read_requests | 15151289 |
| Innodb_buffer_pool_reads | 3597 |
| Innodb_buffer_pool_wait_free | 0 |
| Innodb_buffer_pool_write_requests | 11135 |
| Innodb_checkpoint_age | 0 |
| Innodb_checkpoint_max_age | 7782360 |
| Innodb_checkpoint_target_age | 7539162 |
| Innodb_data_fsyncs | 2703 |
| Innodb_data_pending_fsyncs | 0 |
| Innodb_data_pending_reads | 0 |
| Innodb_data_pending_writes | 0 |
| Innodb_data_read | 122834944 |
| Innodb_data_reads | 7403 |
| Innodb_data_writes | 5741 |
| Innodb_data_written | 113518080 |
| Innodb_dblwr_pages_written | 3403 |
| Innodb_dblwr_writes | 367 |
| Innodb_deadlocks | 0 |
| Innodb_dict_tables | 91 |
| Innodb_have_atomic_builtins | ON |
| Innodb_history_list_length | 906 |
| Innodb_ibuf_discarded_delete_marks | 0 |
| Innodb_ibuf_discarded_deletes | 0 |
| Innodb_ibuf_discarded_inserts | 0 |
| Innodb_ibuf_free_list | 3 |
| Innodb_ibuf_merged_delete_marks | 1 |
| Innodb_ibuf_merged_deletes | 0 |
| Innodb_ibuf_merged_inserts | 40 |
| Innodb_ibuf_merges | 41 |
| Innodb_ibuf_segment_size | 5 |
| Innodb_ibuf_size | 1 |
| Innodb_log_waits | 0 |
| Innodb_log_write_requests | 2177 |
| Innodb_log_writes | 1602 |
| Innodb_lsn_current | 25002719406 |
| Innodb_lsn_flushed | 25002719406 |
| Innodb_lsn_last_checkpoint | 25002719406 |
| Innodb_master_thread_1_second_loops | 15231 |
| Innodb_master_thread_10_second_loops | 1407 |
| Innodb_master_thread_background_loops | 1159 |
| Innodb_master_thread_main_flush_loops | 1159 |
| Innodb_master_thread_sleeps | 15224 |
| Innodb_max_trx_id | 23454578 |
| Innodb_mem_adaptive_hash | 5576304 |
| Innodb_mem_dictionary | 1187505 |
| Innodb_mem_total | 137822208 |
| Innodb_mutex_os_waits | 50 |
| Innodb_mutex_spin_rounds | 20268 |
| Innodb_mutex_spin_waits | 27322 |
| Innodb_oldest_view_low_limit_trx_id | 23454578 |
| Innodb_os_log_fsyncs | 1970 |
| Innodb_os_log_pending_fsyncs | 0 |
| Innodb_os_log_pending_writes | 0 |
| Innodb_os_log_written | 1820672 |
| Innodb_page_size | 16384 |
| Innodb_pages_created | 41 |
| Innodb_pages_read | 7235 |
| Innodb_pages_written | 3403 |
| Innodb_purge_trx_id | 23454397 |
| Innodb_purge_undo_no | 0 |
| Innodb_row_lock_current_waits | 0 |
| Innodb_current_row_locks | 0 |
| Innodb_row_lock_time | 1 |
| Innodb_row_lock_time_avg | 1 |
| Innodb_row_lock_time_max | 1 |
| Innodb_row_lock_waits | 1 |
| Innodb_rows_deleted | 32 |
| Innodb_rows_inserted | 598 |
| Innodb_rows_read | 63568506 |
| Innodb_rows_updated | 434 |
| Innodb_read_views_memory | 112 |
| Innodb_descriptors_memory | 8000 |
| Innodb_s_lock_os_waits | 377 |
| Innodb_s_lock_spin_rounds | 12629 |
| Innodb_s_lock_spin_waits | 956 |
| Innodb_truncated_status_writes | 0 |
| Innodb_x_lock_os_waits | 14 |
| Innodb_x_lock_spin_rounds | 1427 |
| Innodb_x_lock_spin_waits | 82 |
| Key_blocks_not_flushed | 0 |
| Key_blocks_unused | 13396 |
| Key_blocks_used | 1 |
| Key_read_requests | 204 |
| Key_reads | 0 |
| Key_write_requests | 342 |
| Key_writes | 0 |
| Last_query_cost | 0.000000 |
| Max_used_connections | 13 |
| Not_flushed_delayed_rows | 0 |
| Open_files | 20 |
| Open_streams | 0 |
| Open_table_definitions | 111 |
| Open_tables | 123 |
| Opened_files | 14312 |
| Opened_table_definitions | 147 |
| Opened_tables | 166 |
| Performance_schema_cond_classes_lost | 0 |
| Performance_schema_cond_instances_lost | 0 |
| Performance_schema_file_classes_lost | 0 |
| Performance_schema_file_handles_lost | 0 |
| Performance_schema_file_instances_lost | 0 |
| Performance_schema_locker_lost | 0 |
| Performance_schema_mutex_classes_lost | 0 |
| Performance_schema_mutex_instances_lost | 0 |
| Performance_schema_rwlock_classes_lost | 0 |
| Performance_schema_rwlock_instances_lost | 0 |
| Performance_schema_table_handles_lost | 0 |
| Performance_schema_table_instances_lost | 0 |
| Performance_schema_thread_classes_lost | 0 |
| Performance_schema_thread_instances_lost | 0 |
| Prepared_stmt_count | 0 |
| Qcache_free_blocks | 1532 |
| Qcache_free_memory | 3449472 |
| Qcache_hits | 207252 |
| Qcache_inserts | 38014 |
| Qcache_lowmem_prunes | 723 |
| Qcache_not_cached | 45459 |
| Qcache_queries_in_cache | 3102 |
| Qcache_total_blocks | 7825 |
| Queries | 404763 |
| Questions | 404760 |
| Rpl_status | AUTH_MASTER |
| Select_full_join | 220 |
| Select_full_range_join | 0 |
| Select_range | 4587 |
| Select_range_check | 0 |
| Select_scan | 50886 |
| Slave_heartbeat_period | 0.000 |
| Slave_open_temp_tables | 0 |
| Slave_received_heartbeats | 0 |
| Slave_retried_transactions | 0 |
| Slave_running | OFF |
| Slow_launch_threads | 0 |
| Slow_queries | 177 |
| Sort_merge_passes | 25 |
| Sort_range | 3040 |
| Sort_rows | 114929 |
| Sort_scan | 3730 |
| Ssl_accept_renegotiates | 0 |
| Ssl_accepts | 0 |
| Ssl_callback_cache_hits | 0 |
| Ssl_cipher | |
| Ssl_cipher_list | |
| Ssl_client_connects | 0 |
| Ssl_connect_renegotiates | 0 |
| Ssl_ctx_verify_depth | 0 |
| Ssl_ctx_verify_mode | 0 |
| Ssl_default_timeout | 0 |
| Ssl_finished_accepts | 0 |
| Ssl_finished_connects | 0 |
| Ssl_session_cache_hits | 0 |
| Ssl_session_cache_misses | 0 |
| Ssl_session_cache_mode | NONE |
| Ssl_session_cache_overflows | 0 |
| Ssl_session_cache_size | 0 |
| Ssl_session_cache_timeouts | 0 |
| Ssl_sessions_reused | 0 |
| Ssl_used_session_cache_entries | 0 |
| Ssl_verify_depth | 0 |
| Ssl_verify_mode | 0 |
| Ssl_version | |
| Table_locks_immediate | 52485 |
| Table_locks_waited | 0 |
| Tc_log_max_pages_used | 0 |
| Tc_log_page_size | 0 |
| Tc_log_page_waits | 0 |
| Threadpool_idle_threads | 0 |
| Threadpool_threads | 0 |
| Threads_cached | 8 |
| Threads_connected | 3 |
| Threads_created | 1280 |
| Threads_running | 1 |
| Uptime | 22490 |
| Uptime_since_flush_status | 22490 |
| binlog_commits | 0 |
| binlog_group_commits | 0 |
| wsrep_local_state_uuid | b0228e2d-2bc2-11e3-859e-0xfcxfghdfh |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 616 |
| wsrep_replicated | 616 |
| wsrep_replicated_bytes | 659786 |
| wsrep_received | 3 |
| wsrep_received_bytes | 138 |
| wsrep_local_commits | 598 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 45.366667 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 233 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | 10.0.0.8:3306 |
| wsrep_cluster_conf_id | 1 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | b0228e2d-2bc2-11e3-859e-0dzfgsdfg |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 2.6(r152) |
| wsrep_ready | ON |
±-----------------------------------------±-------------------------------------+

My.cnf is stock standard except for setting up node names, addresses and SST auth.

Just had a look at the mysql error log.
I have a lot of these before the cluster crashed and I went over to bootstrap mode:

130926 20:43:18 [Warning] Too many connections
----I have since substantially improved front-end caching and should send less than half the traffic to the databases now.

Saturday morning the cluster became unstable and I had a lot of these:

130928 7:07:40 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping

130928 7:07:40 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.0.0.5:4567
130928 7:07:41 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) reconnecting to 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567), attempt 0
130928 7:07:42 [Note] WSREP: evs::proto(4864b4ca-1969-11e3-91f6-efdb7950b797, GATHER, view_id(REG,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,260)) suspecting node: 36f066e7-1969-11e3-b3e4-bfffdefe859f
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: declaring 240fee80-1969-11e3-b3f3-0b9e4ca22bc1 stable
130928 7:07:43 [Note] WSREP: Node 240fee80-1969-11e3-b3f3-0b9e4ca22bc1 state prim
130928 7:07:43 [Note] WSREP: view(view_id(PRIM,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,261) memb {
240fee80-1969-11e3-b3f3-0b9e4ca22bc1,
4864b4ca-1969-11e3-91f6-efdb7950b797,
} joined {
} left {
} partitioned {
36f066e7-1969-11e3-b3e4-bfffdefe859f,
})
130928 7:07:43 [Note] WSREP: forgetting 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:43 [Note] WSREP: deleting entry tcp://10.0.0.5:4567
130928 7:07:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
130928 7:07:43 [Note] WSREP: forgetting 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://197.242.148.230:4567)
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130928 7:07:43 [Note] WSREP: deleting entry tcp://197.242.148.230:4567
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: sent state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: got state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b from 0 (###5)
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: got state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b from 1 (###7)
130928 7:07:43 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 219,
members = 2/2 (joined/total),
act_id = 5954217,
last_appl. = 5953977,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 73cc2dce-0189-11e3-be43-e2d7eeaef85e
130928 7:07:43 [Note] WSREP: Flow-control interval: [23, 23]
130928 7:07:43 [Note] WSREP: New cluster view: global state: 73cc2dce-0189-11e3-be43-e2d7eeaef85e:5954217, view# 220: Primary, number of nodes: 2, my index: 1, protocol version 2
130928 7:07:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130928 7:07:43 [Note] WSREP: Assign initial position for certification: 5954217, protocol version: 2
130928 7:07:45 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:45 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:46 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:46 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:48 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:48 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:49 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:49 [Note] WSREP: cleaning up 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:49 [Note] WSREP: cleaning up 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://197.242.148.230:4567)
130928 7:07:49 [Warning] WSREP: evs::proto(4864b4ca-1969-11e3-91f6-efdb7950b797, GATHER, view_id(REG,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,261)) source 36f066e7-1969-11e3-b3e4-bfffdefe859f is not supposed to be representative
130928 7:07:50 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping

…etc…

---- and a bit later this:
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: sent state msg: cd389623-27fd-11e3-8df6-6327f157c0f2
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 0 (###5)
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 1 (###4)
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 2 (###7)
130928 7:21:17 [Warning] WSREP: Quorum: No node with complete state:

Version : 2
Flags : 3
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###5
Incoming addr: ‘10.0.0.6:3306’

Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###4
Incoming addr: ‘10.0.0.5:3306’

Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###7
Incoming addr: ‘10.0.0.8:3306’

130928 7:21:17 [Note] WSREP: Full re-merge of primary b9d06527-27fd-11e3-a1dd-36ebc5d2f591 found: 3 of 3.
130928 7:21:17 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 223,
members = 3/3 (joined/total),
act_id = 5954716,

—and eventually the cluster froze. I had to restart one system in bootstrap mode and that is how it is still running.

I have installed pt-stalk, hopefully it will send me some emails during the night.

In the slow log entries it’s the other way around, first is the header with information about query, then there is the query itself.
Anyways:

Query_time: 10.851373 Lock_time: 0.000059 Rows_sent: 0 Rows_examined: 1 Rows_affected: 1 Rows_read: 1

means some serious slow down, I suppose that was for the insert shown in previous example.

pt-stalk snapshot would tell also OS related statistics, like how the IO layer was loaded.

You did not confirm the involved table is InnoDB? Also by “My.cnf is stock standard” you mean also the InnoDB redo logs are default 5MB each and InnoDB buffer pool size default? If yes, then it’s surely not right settings for production purpose.

pt-stalk did not log anything yet, even with threshold set to 10. However, I saw my innodb buffer pool size was set to the default 128M, on a system with 32GB RAM! My active database is 2GB in size. This could be the root cause of all my problems. I added the following lines to my.cnf:

innodb_buffer_pool_size = 10G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 4M

After restarting mysql I did a new slow_log, and the analysis show that none of my queries take more than 10ms now. That’s better.

Next step: analyse the two dead cluster systems’ error logs and set up their innodb_buffer_pool_size. Then try to join the cluster again.

The innodb_buffer_pool_size is the most important setting from InnoDB performance point of view, but note also the innodb_log_file_size is crucial for write performance! Usually setting it to 1GB+ makes sense. (btw: [url]http://www.mysqlperformanceblog.com/2011/07/09/how-to-change-innodb_log_file_size-safely/[/url])