performance on xtrabackup process

Hi again community.
I’m getting Disk I/O overload alerts from zabbix because in our xtrabackup process taking more time and we want to know if our settings are ok or exist some tuning we can do to avoid this issue.

Here is what we got
1 Percona server - 4 cpu, 6 gb ram disk in Raid5 SSD over SAN
1 Percona slave -  8 cpu, 10 gb ram disk in Raid5 SSD  over SAN

our slave ram right now is
              total        used        free      shared  buff/cache   availableMem:           9866        5481        2604         392        1780        3831Swap:          3967         109        3858
- Our application run only versus percona master server.
- The slave is of course for replication and create backup.
percona slave configuration is
-------
# Percona Custom server configuration## Ansible managed
[client]socket=/mysql/mysql.sock
[mysqld]innodb_undo_directory=/mysql/innodb/ibundologsinnodb_log_files_in_group=5innodb_log_group_home_dir=/mysql/innodb/ibredologscollation-server=utf8mb4_unicode_520_ciinnodb_write_io_threads=8innodb_fast_shutdown=0character_set_server=utf8mb4max_connections=1500innodb_buffer_pool_size=3Ginnodb_flush_method=O_DIRECTlower_case_table_names=1max_allowed_packet=1Glog_error_verbosity=3slow_query_log_file=/mysql/log/slow_query_log.txtinnodb_data_home_dir=/mysql/innodb/ibdataslow-query-log=0explicit_defaults_for_timestamp=OFFinnodb_log_file_size=250Mlog_timestamps=SYSTEMinnodb_autoextend_increment=50sql_mode=ANSI,NO_AUTO_CREATE_USER,STRICT_TRANS_TABLESgeneral-log=0wait_timeout=300innodb_flush_log_at_trx_commit=1innodb_data_file_path=data1:100M:autoextendinnodb_buffer_pool_instances=8general-log-file=/mysql/log/general_query_log.txtinnodb_read_io_threads=8innodb_file_per_table=ON
# Percona 5x specific configinnodb_temp_data_file_path=…/ibgtemp/gtemp:50M:autoextend:max:500M

[mysqldump]max_allowed_packet=1G
[mysql]max_allowed_packet=1Gdefault-character-set=utf8mb4no-auto-rehashsocket=/mysql/mysql.sock-----
Our script process run in this way with this values. 80% for parallel, compress and crypt process using 7 cpu.

logs from script using xtrabackup

----

xtrabackup: recognized server arguments: --log_bin=/mysql/binlogs/binlog_node2 --server-id=2 --innodb_undo_directory=/mysql/innodb/ibundologs --innodb_log_files_in_group=5 --innodb_log_group_home_dir=/mysql/innodb/ibredologs --innodb_write_io_threads=8 --innodb_buffer_pool_size=3G --innodb_flush_method=O_DIRECT --innodb_data_home_dir=/mysql/innodb/ibdata --innodb_log_file_size=250M --innodb_autoextend_increment=50 --innodb_flush_log_at_trx_commit=1 --innodb_data_file_path=data1:100M:autoextend --innodb_read_io_threads=8 --innodb_file_per_table=1 --datadir=/mysql/data --parallel=4

xtrabackup: recognized client arguments: --socket=/mysql/mysql.sock --backup=1 --user=xbuser --password=* --stream=xbstream --lock-ddl=1 --slave-info=1 --compress --compress-threads=4 --encrypt=AES256 --encrypt-key=* --encrypt-threads=4

----

step 1- calculate LSN to get from - start because we do backup every 30 mins to get 1 full at midnight and every 30min the differences backups to recovery more actual data.

this action create Read disk over 40-60mb/s with IOWait 90-99% for 3 minutes.


step 2 - Do xtrabackup using xbstream from percona server slave to crypting & compressing and send to repository server over ssh. 

Doing this steps we got with iotop

7 process getting from 2 ~ 20mb/s taking sometimes read disk 1,2 and  2mb/s, even sometime reach 80mb/s, we think is reading to crypt and compress. 

So, with cpu 7 ~1.5mb/s reading this threads * 7 == 103 ~ 190 mb/s Disk Reading with IOwait from  5 ~ 50%, sometimes reach 15-70% IOwait

Step 3 - trasmission

xbstream send 500-600 kb/s with IOwait 6-13% (with low value 2%, and high 38% sometime) * 7 CPU got 

(42-91%) de IOWait (depends on db in that moment)

in the repository we receive the process from xbstream crypting and compressing the xtrabackup files receiving with 500-600 kb/s from percona slave with IOwait 10-23%, some times reach 900Kb/s, 1,5mb/s and 2mb/s.

in that mysql server we got 37 database with 77gb in total 

 the repository server is on the same raid SSD volume on SAN to get speed.. but still getting this disk overload

I know is possible to get multiple IOps bottlenecks on disk/network/CPU/VM/vmware etc.. etc.. I'm trying to figure out what could be.. if need more info feel free to ask me .

Exist some other tool on percona to get analyze the xtrabackup process ? I tried to share more less values to can help to fix or tuning better.


Hi,   that can be many things, obviously the storage is the bottleneck give the iowait you have.  At less than 1MB/s compressed, that means a raw throughput of maybe 2MB/s, that is small.  On top of my head:
- degraded Raid-5 array
- use of InnoDB transparent page compression with punch holeBut in you case, you perform incremental backups, those are normally not fast.  Percona Server can track modified pages, that helps in some cases but if too many pages are modified, it may be faster to just scan (not use the feature).  For details about the page tracking, see:
https://www.percona.com/doc/percona-server/5.7/management/changed_page_tracking.html
Instead of doing incremental backup every 30min,  have you considered streaming the binlogs?  This is “THE” way of getting true PITR with MySQL.   Essentially you use the mysqlbinlog tool to connect to a remote slave and stream the binlog in real-time.  I wrote a script to manage that (there are many others…):
https://github.com/y-trudeau/Mysql-tools/blob/master/backup/binlog-streamer.shIt should be called by cron regularly, it checks the mysqlbinlog process, restarts it if it died and compress binlog files that are complete.