Hi again community.
I’m getting Disk I/O overload alerts from zabbix because in our xtrabackup process taking more time and we want to know if our settings are ok or exist some tuning we can do to avoid this issue.
Here is what we got
1 Percona server - 4 cpu, 6 gb ram disk in Raid5 SSD over SAN
1 Percona slave - 8 cpu, 10 gb ram disk in Raid5 SSD over SAN
our slave ram right now is
total used free shared buff/cache availableMem: 9866 5481 2604 392 1780 3831Swap: 3967 109 3858
- Our application run only versus percona master server.
- The slave is of course for replication and create backup.
percona slave configuration is
-------
# Percona Custom server configuration## Ansible managed
[client]socket=/mysql/mysql.sock
[mysqld]innodb_undo_directory=/mysql/innodb/ibundologsinnodb_log_files_in_group=5innodb_log_group_home_dir=/mysql/innodb/ibredologscollation-server=utf8mb4_unicode_520_ciinnodb_write_io_threads=8innodb_fast_shutdown=0character_set_server=utf8mb4max_connections=1500innodb_buffer_pool_size=3Ginnodb_flush_method=O_DIRECTlower_case_table_names=1max_allowed_packet=1Glog_error_verbosity=3slow_query_log_file=/mysql/log/slow_query_log.txtinnodb_data_home_dir=/mysql/innodb/ibdataslow-query-log=0explicit_defaults_for_timestamp=OFFinnodb_log_file_size=250Mlog_timestamps=SYSTEMinnodb_autoextend_increment=50sql_mode=ANSI,NO_AUTO_CREATE_USER,STRICT_TRANS_TABLESgeneral-log=0wait_timeout=300innodb_flush_log_at_trx_commit=1innodb_data_file_path=data1:100M:autoextendinnodb_buffer_pool_instances=8general-log-file=/mysql/log/general_query_log.txtinnodb_read_io_threads=8innodb_file_per_table=ON
# Percona 5x specific configinnodb_temp_data_file_path=…/ibgtemp/gtemp:50M:autoextend:max:500M
[mysqldump]max_allowed_packet=1G
[mysql]max_allowed_packet=1Gdefault-character-set=utf8mb4no-auto-rehashsocket=/mysql/mysql.sock-----
Our script process run in this way with this values. 80% for parallel, compress and crypt process using 7 cpu.
logs from script using xtrabackup
----
xtrabackup: recognized server arguments: --log_bin=/mysql/binlogs/binlog_node2 --server-id=2 --innodb_undo_directory=/mysql/innodb/ibundologs --innodb_log_files_in_group=5 --innodb_log_group_home_dir=/mysql/innodb/ibredologs --innodb_write_io_threads=8 --innodb_buffer_pool_size=3G --innodb_flush_method=O_DIRECT --innodb_data_home_dir=/mysql/innodb/ibdata --innodb_log_file_size=250M --innodb_autoextend_increment=50 --innodb_flush_log_at_trx_commit=1 --innodb_data_file_path=data1:100M:autoextend --innodb_read_io_threads=8 --innodb_file_per_table=1 --datadir=/mysql/data --parallel=4
xtrabackup: recognized client arguments: --socket=/mysql/mysql.sock --backup=1 --user=xbuser --password=* --stream=xbstream --lock-ddl=1 --slave-info=1 --compress --compress-threads=4 --encrypt=AES256 --encrypt-key=* --encrypt-threads=4
----step 1- calculate LSN to get from - start because we do backup every 30 mins to get 1 full at midnight and every 30min the differences backups to recovery more actual data.
this action create Read disk over 40-60mb/s with IOWait 90-99% for 3 minutes.
step 2 - Do xtrabackup using xbstream from percona server slave to crypting & compressing and send to repository server over ssh.
Doing this steps we got with iotop
7 process getting from 2 ~ 20mb/s taking sometimes read disk 1,2 and 2mb/s, even sometime reach 80mb/s, we think is reading to crypt and compress.
So, with cpu 7 ~1.5mb/s reading this threads * 7 == 103 ~ 190 mb/s Disk Reading with IOwait from 5 ~ 50%, sometimes reach 15-70% IOwait
Step 3 - trasmission
xbstream send 500-600 kb/s with IOwait 6-13% (with low value 2%, and high 38% sometime) * 7 CPU got
(42-91%) de IOWait (depends on db in that moment)
in the repository we receive the process from xbstream crypting and compressing the xtrabackup files receiving with 500-600 kb/s from percona slave with IOwait 10-23%, some times reach 900Kb/s, 1,5mb/s and 2mb/s.
in that mysql server we got 37 database with 77gb in total
the repository server is on the same raid SSD volume on SAN to get speed.. but still getting this disk overload
I know is possible to get multiple IOps bottlenecks on disk/network/CPU/VM/vmware etc.. etc.. I'm trying to figure out what could be.. if need more info feel free to ask me .
Exist some other tool on percona to get analyze the xtrabackup process ? I tried to share more less values to can help to fix or tuning better.