Aborted connections, Apache crashing & general mayhem

Well, I have been trying to find the source of a major bottleneck for the past few days.

What happens is apache connections spike and cause the server to grind to a halt as each apache connection tends to run a script which also opens a mysql connection. So things spiral into chaos as apache connections build up. These spikes occur intermittently and occur much more during peak time. The question is what is causing the spikes and the spiral into chaos as the server runs fine under normal heavy load between each spike.

In the past days I have switched from the Apache 2 prefork MPM to the worker MPM - what happens now is instead of the server grinding to a halt but still working, apache actually crashes and the error_log message is usually that MaxClients was reached. The crashing doesn’t matter though as Apache still needed to be restarted when I was running the prefork MPM anyway. This indicates to me that the underlying problem is still there and not an Apache problem.

The next thing I did was disable mysql query cache caching by default because I had a theory that mass invalidation of frequently modified tables might be causing serious problems and was often wasting memory because I have an APC caching system on the application level which supersedes the mysql query cache in many cases.

Now, I have looked extensively at my mysql slow queries log, mysql error_log and unix messages log and at the apache error_log around each crash. My slow query time is set to 1 second and I have caught most of the major slow queries I could find. I’ve even enabled and inspected the general query log. The two things I can say are that:

  • the apache error_log usually registers a “MaxClients reached. Consider raising MaxClients” error before the crash. MaxClients is set to 400 which is well above normal load. I’ve set it to 1000 and the crashes still occur, they just take longer and are more painful as the system starts thrashing.

  • the mysql error_log is registering hundreds of aborted connections around each crash. In some cases over 400 aborted connections in a second.

My server is a dual harpertown (8 cpus) with 12GB RAM and a 100mbit port. The database is on a dedicated SCSI hard drive and all static content is served by separate servers.

mysql> SHOW STATUS;±----------------------------------±-----------+| Variable_name | Value |±----------------------------------±-----------+| Aborted_clients | 44521 || Aborted_connects | 1118 || Binlog_cache_disk_use | 0 || Binlog_cache_use | 0 || Bytes_received | 115 || Bytes_sent | 181 || Com_admin_commands | 0 || Com_alter_db | 0 || Com_alter_table | 0 || Com_analyze | 0 || Com_backup_table | 0 || Com_begin | 0 || Com_call_procedure | 0 || Com_change_db | 0 || Com_change_master | 0 || Com_check | 0 || Com_checksum | 0 || Com_commit | 0 || Com_create_db | 0 || Com_create_function | 0 || Com_create_index | 0 || Com_create_table | 0 || Com_create_user | 0 || Com_dealloc_sql | 0 || Com_delete | 0 || Com_delete_multi | 0 || Com_do | 0 || Com_drop_db | 0 || Com_drop_function | 0 || Com_drop_index | 0 || Com_drop_table | 0 || Com_drop_user | 0 || Com_execute_sql | 0 || Com_flush | 0 || Com_grant | 0 || Com_ha_close | 0 || Com_ha_open | 0 || Com_ha_read | 0 || Com_help | 0 || Com_insert | 0 || Com_insert_select | 0 || Com_kill | 0 || Com_load | 0 || Com_load_master_data | 0 || Com_load_master_table | 0 || Com_lock_tables | 0 || Com_optimize | 0 || Com_preload_keys | 0 || Com_prepare_sql | 0 || Com_purge | 0 || Com_purge_before_date | 0 || Com_rename_table | 0 || Com_repair | 0 || Com_replace | 0 || Com_replace_select | 0 || Com_reset | 0 || Com_restore_table | 0 || Com_revoke | 0 || Com_revoke_all | 0 || Com_rollback | 0 || Com_savepoint | 0 || Com_select | 1 || Com_set_option | 0 || Com_show_binlog_events | 0 || Com_show_binlogs | 0 || Com_show_charsets | 0 || Com_show_collations | 0 || Com_show_column_types | 0 || Com_show_create_db | 0 || Com_show_create_table | 0 || Com_show_databases | 0 || Com_show_errors | 0 || Com_show_fields | 0 || Com_show_grants | 0 || Com_show_innodb_status | 0 || Com_show_keys | 0 || Com_show_logs | 0 || Com_show_master_status | 0 || Com_show_ndb_status | 0 || Com_show_new_master | 0 || Com_show_open_tables | 0 || Com_show_privileges | 0 || Com_show_processlist | 0 || Com_show_slave_hosts | 0 || Com_show_slave_status | 0 || Com_show_status | 1 || Com_show_storage_engines | 0 || Com_show_tables | 0 || Com_show_triggers | 0 || Com_show_variables | 0 || Com_show_warnings | 0 || Com_slave_start | 0 || Com_slave_stop | 0 || Com_stmt_close | 0 || Com_stmt_execute | 0 || Com_stmt_fetch | 0 || Com_stmt_prepare | 0 || Com_stmt_reset | 0 || Com_stmt_send_long_data | 0 || Com_truncate | 0 || Com_unlock_tables | 0 || Com_update | 0 || Com_update_multi | 0 || Com_xa_commit | 0 || Com_xa_end | 0 || Com_xa_prepare | 0 || Com_xa_recover | 0 || Com_xa_rollback | 0 || Com_xa_start | 0 || Compression | OFF || Connections | 3729736 || Created_tmp_disk_tables | 0 || Created_tmp_files | 543 || Created_tmp_tables | 1 || Delayed_errors | 2132890 || Delayed_insert_threads | 11 || Delayed_writes | 5390758 || Flush_commands | 1 || Handler_commit | 0 || Handler_delete | 0 || Handler_discover | 0 || Handler_prepare | 0 || Handler_read_first | 0 || Handler_read_key | 0 || Handler_read_next | 0 || Handler_read_prev | 0 || Handler_read_rnd | 0 || Handler_read_rnd_next | 0 || Handler_rollback | 0 || Handler_savepoint | 0 || Handler_savepoint_rollback | 0 || Handler_update | 0 || Handler_write | 132 || Innodb_buffer_pool_pages_data | 124849 || Innodb_buffer_pool_pages_dirty | 51 || Innodb_buffer_pool_pages_flushed | 560603 || Innodb_buffer_pool_pages_free | 2 || Innodb_buffer_pool_pages_latched | 1 || Innodb_buffer_pool_pages_misc | 6221 || Innodb_buffer_pool_pages_total | 131072 || Innodb_buffer_pool_read_ahead_rnd | 249 || Innodb_buffer_pool_read_ahead_seq | 1146 || Innodb_buffer_pool_read_requests | 2613650366 || Innodb_buffer_pool_reads | 101110 || Innodb_buffer_pool_wait_free | 0 || Innodb_buffer_pool_write_requests | 5355760 || Innodb_data_fsyncs | 99895 || Innodb_data_pending_fsyncs | 0 || Innodb_data_pending_reads | 0 || Innodb_data_pending_writes | 0 || Innodb_data_read | 2121207808 || Innodb_data_reads | 110176 || Innodb_data_writes | 1271653 || Innodb_data_written | 2431035904 || Innodb_dblwr_pages_written | 560603 || Innodb_dblwr_writes | 11367 || Innodb_log_waits | 0 || Innodb_log_write_requests | 1914050 || Innodb_log_writes | 786942 || Innodb_os_log_fsyncs | 76904 || Innodb_os_log_pending_fsyncs | 0 || Innodb_os_log_pending_writes | 0 || Innodb_os_log_written | 1238141440 || Innodb_page_size | 16384 || Innodb_pages_created | 4899 || Innodb_pages_read | 129335 || Innodb_pages_written | 560603 || Innodb_row_lock_current_waits | 0 || Innodb_row_lock_time | 1016 || Innodb_row_lock_time_avg | 7 || Innodb_row_lock_time_max | 72 || Innodb_row_lock_waits | 138 || Innodb_rows_deleted | 40039 || Innodb_rows_inserted | 54930 || Innodb_rows_read | 1092053930 || Innodb_rows_updated | 658682 || Key_blocks_not_flushed | 0 || Key_blocks_unused | 84660 || Key_blocks_used | 135516 || Key_read_requests | 958821347 || Key_reads | 140233 || Key_write_requests | 3120784 || Key_writes | 2058025 || Last_query_cost | 0.000000 || Max_used_connections | 335 || Ndb_cluster_node_id | 0 || Ndb_config_from_host | || Ndb_config_from_port | 0 || Ndb_number_of_data_nodes | 0 || Not_flushed_delayed_rows | 0 || Open_files | 1158 || Open_streams | 0 || Open_tables | 1798 || Opened_tables | 0 || Prepared_stmt_count | 0 || Qcache_free_blocks | 10088 || Qcache_free_memory | 69483352 || Qcache_hits | 5381877 || Qcache_inserts | 1202186 || Qcache_lowmem_prunes | 7319 || Qcache_not_cached | 27700147 || Qcache_queries_in_cache | 21196 || Qcache_total_blocks | 53751 || Questions | 199318311 || Rpl_status | NULL || Select_full_join | 0 || Select_full_range_join | 0 || Select_range | 0 || Select_range_check | 0 || Select_scan | 1 || Slave_open_temp_tables | 0 || Slave_retried_transactions | 0 || Slave_running | OFF || Slow_launch_threads | 0 || Slow_queries | 0 || Sort_merge_passes | 0 || Sort_range | 0 || Sort_rows | 0 || Sort_scan | 0 || Table_locks_immediate | 49034075 || Table_locks_waited | 730 || Tc_log_max_pages_used | 0 || Tc_log_page_size | 0 || Tc_log_page_waits | 0 || Threads_cached | 33 || Threads_connected | 18 || Threads_created | 17038 || Threads_running | 4 || Uptime | 67595 || Uptime_since_flush_status | 67595 |±----------------------------------±-----------+230 rows in set (0.04 sec)mysql> SHOW VARIABLES;±--------------------------------±------------------------------+| Variable_name | Value |±--------------------------------±------------------------------+| auto_increment_increment | 1 || auto_increment_offset | 1 || automatic_sp_privileges | ON || back_log | 50 || basedir | / || binlog_cache_size | 32768 || bulk_insert_buffer_size | 8388608 || character_set_client | latin1 || character_set_connection | latin1 || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | latin1 || character_set_server | latin1 || character_set_system | utf8 || character_sets_dir | /usr/share/mysql/charsets/ || collation_connection | latin1_swedish_ci || collation_database | latin1_swedish_ci || collation_server | latin1_swedish_ci || completion_type | 0 || concurrent_insert | 1 || connect_timeout | 10 || datadir | /database/mysql/ || date_format | %Y-%m-%d || datetime_format | %Y-%m-%d %H:%i:%s || default_week_format | 0 || delay_key_write | ON || delayed_insert_limit | 100 || delayed_insert_timeout | 300 || delayed_queue_size | 1000 || div_precision_increment | 4 || keep_files_on_create | OFF || engine_condition_pushdown | OFF || expire_logs_days | 0 || flush | OFF || flush_time | 0 || ft_boolean_syntax | + -><()~*:“”&| || ft_max_word_len | 84 || ft_min_word_len | 4 || ft_query_expansion_limit | 20 || ft_stopword_file | (built-in) || group_concat_max_len | 1024 || have_archive | YES || have_bdb | NO || have_blackhole_engine | YES || have_compress | YES || have_crypt | YES || have_csv | YES || have_dynamic_loading | YES || have_example_engine | YES || have_federated_engine | YES || have_geometry | YES || have_innodb | YES || have_isam | NO || have_merge_engine | YES || have_ndbcluster | DISABLED || have_openssl | NO || have_ssl | NO || have_query_cache | YES || have_raid | NO || have_rtree_keys | YES || have_symlink | YES || hostname | fpsbanana.arkerion.com || init_connect | || init_file | || init_slave | || innodb_additional_mem_pool_size | 1048576 || innodb_autoextend_increment | 8 || innodb_buffer_pool_awe_mem_mb | 0 || innodb_buffer_pool_size | 2147483648 || innodb_checksums | ON || innodb_commit_concurrency | 0 || innodb_concurrency_tickets | 500 || innodb_data_file_path | ibdata1:10M:autoextend || innodb_data_home_dir | /database/mysql || innodb_adaptive_hash_index | ON || innodb_doublewrite | ON || innodb_fast_shutdown | 1 || innodb_file_io_threads | 4 || innodb_file_per_table | OFF || innodb_flush_log_at_trx_commit | 2 || innodb_flush_method | || innodb_force_recovery | 0 || innodb_lock_wait_timeout | 50 || innodb_locks_unsafe_for_binlog | OFF || innodb_log_arch_dir | || innodb_log_archive | OFF || innodb_log_buffer_size | 1048576 || innodb_log_file_size | 5242880 || innodb_log_files_in_group | 2 || innodb_log_group_home_dir | ./ || innodb_max_dirty_pages_pct | 90 || innodb_max_purge_lag | 0 || innodb_mirrored_log_groups | 1 || innodb_open_files | 300 || innodb_rollback_on_timeout | OFF || innodb_support_xa | ON || innodb_sync_spin_loops | 20 || innodb_table_locks | ON || innodb_thread_concurrency | 8 || innodb_thread_sleep_delay | 10000 || interactive_timeout | 30 || join_buffer_size | 1048576 || key_buffer_size | 1073741824 || key_cache_age_threshold | 300 || key_cache_block_size | 1024 || key_cache_division_limit | 100 || language | /usr/share/mysql/english/ || large_files_support | ON || large_page_size | 0 || large_pages | OFF || lc_time_names | en_US || license | GPL || local_infile | ON || locked_in_memory | OFF || log | OFF || log_bin | OFF || log_bin_trust_function_creators | OFF || log_error | ./mysql_errors.log || log_queries_not_using_indexes | ON || log_slave_updates | OFF || log_slow_queries | ON || log_warnings | 2 || long_query_time | 1 || low_priority_updates | ON || lower_case_file_system | OFF || lower_case_table_names | 0 || max_allowed_packet | 33554432 || max_binlog_cache_size | 4294963200 || max_binlog_size | 1073741824 || max_connect_errors | 10 || max_connections | 400 || max_delayed_threads | 20 || max_error_count | 64 || max_heap_table_size | 16777216 || max_insert_delayed_threads | 20 || max_join_size | 18446744073709551615 || max_length_for_sort_data | 1024 || max_prepared_stmt_count | 16382 || max_relay_log_size | 0 || max_seeks_for_key | 4294967295 || max_sort_length | 1024 || max_sp_recursion_depth | 0 || max_tmp_tables | 32 || max_user_connections | 400 || max_write_lock_count | 4294967295 || multi_range_count | 256 || myisam_data_pointer_size | 6 || myisam_max_sort_file_size | 2146435072 || myisam_recover_options | OFF || myisam_repair_threads | 1 || myisam_sort_buffer_size | 67108864 || myisam_stats_method | nulls_unequal || ndb_autoincrement_prefetch_sz | 1 || ndb_force_send | ON || ndb_use_exact_count | ON || ndb_use_transactions | ON || ndb_cache_check_time | 0 || ndb_connectstring | || net_buffer_length | 16384 || net_read_timeout | 30 || net_retry_count | 10 || net_write_timeout | 60 || new | OFF || old_passwords | ON || open_files_limit | 4506 || optimizer_prune_level | 1 || optimizer_search_depth | 62 || pid_file | /database/mysql/mysqld.pid || plugin_dir | || port | 3306 || preload_buffer_size | 32768 || profiling | OFF || profiling_history_size | 15 || protocol_version | 10 || query_alloc_block_size | 8192 || query_cache_limit | 8388608 || query_cache_min_res_unit | 4096 || query_cache_size | 134217728 || query_cache_type | DEMAND || query_cache_wlock_invalidate | OFF || query_prealloc_size | 8192 || range_alloc_block_size | 4096 || read_buffer_size | 4194304 || read_only | OFF || read_rnd_buffer_size | 4194304 || relay_log | || relay_log_index | || relay_log_info_file | relay-log.info || relay_log_purge | ON || relay_log_space_limit | 0 || rpl_recovery_rank | 0 || secure_auth | OFF || secure_file_priv | || server_id | 1 || skip_external_locking | ON || skip_networking | OFF || skip_show_database | OFF || slave_compressed_protocol | OFF || slave_load_tmpdir | /tmp/ || slave_net_timeout | 3600 || slave_skip_errors | OFF || slave_transaction_retries | 10 || slow_launch_time | 2 || socket | /var/lib/mysql/mysql.sock || sort_buffer_size | 8388608 || sql_big_selects | ON || sql_mode | || sql_notes | ON || sql_warnings | OFF || ssl_ca | || ssl_capath | || ssl_cert | || ssl_cipher | || ssl_key | || storage_engine | MyISAM || sync_binlog | 0 || sync_frm | ON || system_time_zone | UTC || table_cache | 2048 || table_lock_wait_timeout | 50 || table_type | MyISAM || thread_cache_size | 40 || thread_stack | 196608 || time_format | %H:%i:%s || time_zone | SYSTEM || timed_mutexes | OFF || tmp_table_size | 134217728 || tmpdir | /tmp || transaction_alloc_block_size | 8192 || transaction_prealloc_size | 4096 || tx_isolation | REPEATABLE-READ || updatable_views_with_limit | YES || version | 5.0.67-community-log || version_comment | MySQL Community Edition (GPL) || version_compile_machine | i686 || version_compile_os | redhat-linux-gnu || wait_timeout | 30 |±--------------------------------±------------------------------+237 rows in set (0.01 sec)

I have also attached an example of the aborted connections. You can see from the time that they start to alleviate at the end of the log. What I noticed is that the error changes from “Got an error reading communication packets” to “Got timeout reading communication packets” some way through the log.

Can anyone help me? (

Thanks

I can’t see anything directly wrong in your configuration and status variables.

Although this looks a bit odd:

| Delayed_errors | 2132890 || Delayed_writes | 5390758 |

This means that roughly 45% of your delayed inserts fail.
So the question is if your application thinks it’s OK that all these queries fail.

But the real question is:
Wwhat is the server actually doing when this happens?

Do you have high CPU load? Is it Apache or MySQL that is consuming CPU?
Or do you have high IO wait?
Is the system swapping a lot?
Is Apache using a lot of RAM?
etc.

You will have to setup something that monitors these things and unfortunately wait for the next crash (or if you can simulate it in a test environment) to see what actually happens directly before this happens.

You will have to look at some of those things

Thanks for responding.

It is hard to track down. From looking at top CPU is usually over 50% idle and there is spare RAM. No swap is being used. CPU IO hovers around 2-5%.

I suspect the problem may be Apache CPU usage, but again it is difficult to make sense of the fluctuations and apache extended status CPU usage.

What confuses me more than anything is why Apache is crashing and not just reaching MaxClients and slowing to a halt like it should.

top - 17:01:11 up 24 days, 13:46, 3 users, load average: 4.30, 4.68, 5.44Tasks: 156 total, 1 running, 154 sleeping, 0 stopped, 1 zombieCpu(s): 24.5% us, 12.4% sy, 0.0% ni, 57.6% id, 5.6% wa, 0.0% hi, 0.0% siMem: 12471636k total, 11343472k used, 1128164k free, 434684k buffersSwap: 4192924k total, 20172k used, 4172752k free, 7527300k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9898 mysql 15 0 2948m 2.7g 5392 S 69.9 22.5 2917:09 mysqld 9651 nobody 16 0 987m 64m 30m S 66.9 0.5 0:02.78 httpd 9506 nobody 16 0 994m 83m 32m S 66.5 0.7 0:04.14 httpd 9463 nobody 16 0 983m 73m 41m S 51.9 0.6 0:05.96 httpd 9723 nobody 16 0 983m 45m 14m S 37.9 0.4 0:01.14 httpd 5803 root 15 0 0 0 0 S 0.7 0.0 4:21.47 kjournald 2737 root 15 0 0 0 0 S 0.3 0.0 51:43.42 kjournald22978 root 16 0 7664 2348 1864 S 0.3 0.0 0:00.28 sshd11911 root 16 0 3884 1048 772 R 0.3 0.0 0:03.79 top

Do you have any logging on the amount of connections that your apache is getting or incoming packages?
Or how many page requests?

One theory could be that you are experience DDOS attacks.
Which could just overload the apache server and give very strange results.

Because from a MySQL point of view it looks just like the clients are stopping to send/receive data and then the connection timouts according to your wait_timeout and interactive_timeout values.

It is hard for me to monitor the apache processes over time but I think that would reveal a lot. I have asked the techs helping me to create a script to write the apache extended status to file every second to see what happens during the crashes. If it were an attack it is unlike anything I have seen because there are none of the usual signs of a DOS like many of the same IP and many clients hitting a single page. During peak Apache is processing about 150 requests per second, and these are all for dynamic content as static content is hosted on other servers.

Actually yesterday before peak a decided to make an emergency caching solution which is now caching pages as HTML with a TTL of 10 minutes for guest users, which account for 80% of traffic (I can’t use this rudimentary cache system for logged-in members as they all see different things). I am not sure if a disk cache like this is any better - there were still a lot of crashes during peak while I was asleep.

Anyway, I am continuing with application-level optimizations such as more APC usage (though I wonder if this will lead to more Apache CPU usage and be making the problem worse! So confusing () - I’m trying to get the queries per page down below 10 - currently at about 20.