IST of a 5.7 joiner fails against a 5.7 donor (message too long), works against 5.6

Hello,

Long story short: IST of a 5.7 node works against 5.6, but not 5.7

Since several days, I try to upgrade our three node percona 5.6 cluster by adding two 5.7 nodes and then rotating the 5.6 nodes out of service.

The three 5.6 machines (node1, node2, node3) are CentOS 6.9 and run Percona-XtraDB-Cluster-server-56-5.6.30 with galera2
A fourth machine (node4) with CentOS 7.6 and Percona-XtraDB-Cluster-server-57-5.7.22 has been added
node4 has a few added config options compared to a 5.6 my.cnf, for app compatibility and to enable SST/IST between galera2 and galera3:
node4 was empty, so a SST was triggered (successfully)
mysql_upgrade was successfully executed on node4
node4 was restarted, IST worked. node4 is operational
A fifth machine (node5) with CentOS 7.6 and Percona-XtraDB-Cluster-server-57-5.7.22 has been added
Additionally to the config mods of node4, node5 was told to sync against node4 (wsrep_sst_donor=“node4,”)
node5 was empty, so a SST was triggered against node4 (successfully)
If mysql_upgrade is called, it exits with: no upgrade necessary (makes sense, as the donor is 5.7)
It is worth noting, that at this point node5 is fully functional. wsrep_last_applied and wsrep_last_committed get updated and are in sync with the others
node5 was restarted, IST begins, then breaks with “Message too long”
node5 then starts an SST :frowning:
Wiping node5 followed by SST/IST against a 5.6 node works.

The errorlog of step 12 is:

2019-01-28T07:42:21.253221Z 0 [Note] WSREP: Signalling provider to continue on SST completion.
2019-01-28T07:42:21.253258Z 0 [Note] WSREP: Initialized wsrep sidno 2
2019-01-28T07:42:21.253282Z 0 [Note] WSREP: SST received: ee258004-e21f-11e1-0800-ebda6a123e79:14683202377
2019-01-28T07:42:21.253417Z 2 [Note] WSREP: Receiving IST: 1374 writesets, seqnos 14683202377-14683203751
2019-01-28T07:42:21.253477Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.22-22-57-log' socket: '/tmp/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel22, Revision da86071, WSREP version 29.26, wsrep_29.26
2019-01-28T07:42:21.253534Z 0 [Note] WSREP: Receiving IST... 0.0% ( 0/1374 events) complete.
2019-01-28T07:42:21.285883Z 6 [ERROR] WSREP: receiving IST failed, node restart required: : 90 (Message too long)
at galera/src/write_set.cpp:segment():46
2019-01-28T07:42:21.285937Z 6 [ERROR] WSREP: failed trx: source: bf48d594-58f7-11e8-9f5b-a6cf205daf34 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 144931162 trx_id: 389345841046 seqnos (l: -1, g: 14683202378, s: 14683202377, d: 14683202377, ts: 1548661282426061330)

The list above is the current (and reproduceable) iteration of debugging.
Imho, it is neither a network nor firewall problem.
As the nodes are in the same network segment, there is no firewall in between them, latency is ca 200ms, SST works.

The error “Message too long” was always the problem and should have been addressed by wsrep_max_ws_size=1073741824, wsrep_max_ws_rows=131072 according to TheInternet™
Apparently not.
Does anyone have a pointer where to look or how to debug “Message too long” further?

Configs:

node1 (5.6)

[client]
user = root
password = <PASSWORD>
port = 3306
socket = /tmp/mysql.sock

[mysqld_multi]

[mysqld_safe]
core-file-size=unlimited

[sst]
streamfmt=xbstream
progress=1
time=1
rlimit=75m

[mysqld]
wsrep_cluster_name=gfdbcluster
wsrep_node_name=node1
wsrep_slave_threads=4
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_node_address=10.10.26.73

wsrep_sst_method=xtrabackup-v2
wsrep_provider_options="gcache.size=4G"
wsrep_sst_auth=root:<PASSWORD>
wsrep_cluster_address=gcomm://10.10.26.36,10.10.26.37,10.10.26.38,10.10.26.39

user=mysql

character_set_server = utf8mb4

;read_only
skip-name-resolve
skip-slave-start
port = 3306
socket = /tmp/mysql.sock
#skip-locking
key_buffer_size = 64M
max_allowed_packet = 1M
max_connections = 2048
read_buffer_size = 1M
read_rnd_buffer_size = 4M
myisam_sort_buffer_size = 64M
thread_cache_size = 64
datadir = /schooner/data/xtradb
basedir = /usr/
default-storage-engine=innodb
log-error = /schooner/data/xtradb/xtradb-mysql.err
core-file

query_cache_size = 0
query_cache_type = 0

slow_query_log = 1
long_query_time = 30

tmpdir = /schooner/data/tmp/xtradb
server-id = 101071
log-bin = binlog
log_slave_updates = 1

binlog_format = ROW
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1

innodb_file_per_table

innodb_data_home_dir = /schooner/data/xtradb
innodb_data_file_path = ibdata1:100M:autoextend
innodb_log_group_home_dir = /schooner/data/xtradb

innodb_buffer_pool_size = 148G
innodb_additional_mem_pool_size = 20M

innodb_log_file_size = 2G
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 0
innodb_read_ahead-threshold = 0
innodb_flush_method = O_DIRECT
innodb_doublewrite = 0
expire_logs_days = 3

innodb_open_files = 4096
open_files_limit = 32768

tmp_table_size = 256M
max_heap_table_size = 256M

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash

[isamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

node5 (5.7)

[client]
user = root
password = <PASSWORD>
port = 3306
socket = /tmp/mysql.sock

[mysqld_multi]

[mysqld_safe]
core-file-size=unlimited

[sst]
streamfmt=xbstream
progress=1
time=1
rlimit=75m

[mysqld]
sql-mode="STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"
pxc_strict_mode = PERMISSIVE

wsrep_cluster_name=gfdbcluster
wsrep_node_name=node5
wsrep_slave_threads=4
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_node_address=10.10.26.36

wsrep_sst_method=xtrabackup-v2
wsrep_provider_options="gcache.size=4G; socket.checksum=1"
wsrep_sst_auth=root:<PASSWORD>
wsrep_cluster_address=gcomm://10.10.26.37,10.10.26.38,10.10.26.39,10.10.26.73
wsrep_sst_donor="node4,"
wsrep_max_ws_size=1073741824
wsrep_max_ws_rows=131072

user=mysql

character_set_server = utf8mb4

skip-name-resolve
skip-slave-start
port = 3306
socket = /tmp/mysql.sock
#skip-locking
key_buffer_size = 64M
max_allowed_packet = 1M
max_connections = 2048
read_buffer_size = 1M
read_rnd_buffer_size = 4M
myisam_sort_buffer_size = 64M
thread_cache_size = 64
datadir = /schooner/data/xtradb/
basedir = /usr/
default-storage-engine=innodb
log-error = /schooner/data/xtradb/xtradb-mysql.err
core-file


slow_query_log = 1
long_query_time = 30

tmpdir = /schooner/data/tmp/xtradb/
server-id = 101071
log-bin = binlog
log_slave_updates = 1

binlog_format = ROW
max_binlog_files = 10
max_binlog_size = 1G
innodb_autoinc_lock_mode=2

innodb_file_per_table

innodb_data_home_dir = /schooner/data/xtradb
innodb_data_file_path = ibdata1:100M:autoextend
innodb_log_group_home_dir = /schooner/data/xtradb

innodb_buffer_pool_size = 148G

innodb_log_file_size = 2G
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 0
innodb_read_ahead-threshold = 0
innodb_flush_method = O_DIRECT
innodb_doublewrite = 0
expire_logs_days = 3

innodb_open_files = 4096
open_files_limit = 32768

tmp_table_size = 256M
max_heap_table_size = 256M

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash

[isamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout