Replication error 1236 "unknown error reading log event on the master"

Hello,

we are using two instances of 5.7.20-18-log Percona Server in a master-master (dbm1 / dbm2) replication. The application only uses dbm1 for writes and reads so the dbm2 is used as a cold standby only.
In the last weeks the replication on dbm2 stops multiple times per week with the error message “unknown error reading log event on the master”. Here is the full slave status (dbm2) from the last incident:

mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: 
Master_Host: 10.6.233.101
Master_User: replicator
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.007091
Read_Master_Log_Pos: 71137
Relay_Log_File: mysqld-relay-bin.001603
Relay_Log_Pos: 71350
Relay_Master_Log_File: mysql-bin.007091
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: 
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 71137
Relay_Log_Space: 71605
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File: 
Master_SSL_CA_Path: 
Master_SSL_Cert: 
Master_SSL_Cipher: 
Master_SSL_Key: 
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'unknown error reading log event on the master; the first event 'mysql-bin.007003' at 43445, the last event read from './mysql-bin.007091' at 71137, the last byte read from './mysql-bin.007091' at 71137.'
Last_SQL_Errno: 0
Last_SQL_Error: 
Replicate_Ignore_Server_Ids: 
Master_Server_Id: 1
Master_UUID: 7cb8eccb-e1a0-11e7-9d96-545edb2e572a
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind: 
Last_IO_Error_Timestamp: 180308 06:41:25
Last_SQL_Error_Timestamp: 
Master_SSL_Crl: 
Master_SSL_Crlpath: 
Retrieved_Gtid_Set: 
Executed_Gtid_Set: 
Auto_Position: 0
Replicate_Rewrite_DB: 
Channel_Name: 
Master_TLS_Version: 
1 row in set (0.00 sec)

Generally we address this with a full restore of dbm2 with XtraBackup but after a few days the problem on dbm2 appears again.
As we can see in our monitoring both servers have enough resources left (and no IO-Wait) - what we can see is, that this often happens when the application is doing some import jobs (writing in dbm1):

Both servers are located in the same datacenter, communication via the internal network.

More details:

OS: Ubuntu 14.04.5 LTS
Kernel: OpenVZ VPS (Kernel 2.6.32-042stab127.2)
CPU: 8 cores Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
RAM: 16 gb
Percona: 5.7.20-18-log Percona Server (GPL), Release '18', Revision '7ce54a6deef'

my.cnf

[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock

[isamchk]
key_buffer_size = 16M

[mysqld]
basedir = /usr
bind_address = *
binlog_cache_size = 1M
binlog_format = mixed
bulk_insert_buffer_size = 64M
datadir = /var/lib/mysql
expire_logs_days = 2
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON
innodb_buffer_pool_size = 12G
innodb_doublewrite = OFF
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 1000
innodb_log_file_size = 256M
innodb_read_io_threads = 8
innodb_write_io_threads = 8
join_buffer_size = 244K
key_buffer_size = 100M
log-bin = mysql-bin
log-error = /var/log/mysql/error.log
log_slow_verbosity = full
long_query_time = 1
max_allowed_packet = 16M
max_binlog_files = 200
max_binlog_size = 100M
max_connections = 500
max_heap_table_size = 128M
max_relay_log_size = 256M
max_slowlog_files = 1
max_slowlog_size = 4G
myisam_sort_buffer_size = 64M
optimizer_switch = index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=off,block_nested_loop=off,batched_key_access=on,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on
pid-file = /var/run/mysqld/mysqld.pid
port = 3306
query_cache_limit = 1M
query_cache_size = 128M
query_cache_type = ON
read_buffer_size = 244K
relay_log = mysqld-relay-bin
server_id = 1
skip-external-locking
skip_name_resolve = ON
slow_query_log = ON
slow_query_log_file = /var/log/mysql/mysqld.slowlog
socket = /var/run/mysqld/mysqld.sock
sort_buffer_size = 2047K
sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
ssl = false
ssl-ca = /etc/mysql/cacert.pem
ssl-cert = /etc/mysql/server-cert.pem
ssl-key = /etc/mysql/server-key.pem
thread_cache_size = 8
thread_stack = 256K
tmp_table_size = 128M
tmpdir = /tmp
user = mysql
userstat = ON

[mysqld-5.0]
myisam-recover = BACKUP

[mysqld-5.1]
myisam-recover = BACKUP

[mysqld-5.5]
myisam-recover = BACKUP

[mysqld-5.6]
myisam-recover-options = BACKUP

[mysqld-5.7]
myisam-recover-options = BACKUP

[mysqld_safe]
log-error = /var/log/mysql/error.log
nice = 0
socket = /var/run/mysqld/mysqld.sock

[mysqldump]
max_allowed_packet = 16M
quick
quote-names

Does anybody has an idea how to mitigate the root cause for this type of replication error?

Thanks and best regards

Adrian