cannot join cluster after node upgrade.

Hello,

I have upgaded one node on the existing cluster which is running on debian wheezy percona xtradb 5.6 with the following versions of packages:


percona-xtrabackup 2.2.12-1.wheezy
percona-xtradb-cluster-5.6-dbg 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-client-5.6 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-common-5.6 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-full-56 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-galera-3 3.9.3494.wheezy
percona-xtradb-cluster-galera-3.x 3.9.3494.wheezy
percona-xtradb-cluster-galera-3.x-dbg 3.9.3494.wheezy
percona-xtradb-cluster-galera3-dbg 3.9.3494.wheezy
percona-xtradb-cluster-garbd-3 3.9.3494.wheezy
percona-xtradb-cluster-garbd-3.x 3.9.3494.wheezy
percona-xtradb-cluster-garbd-3.x-dbg 3.9.3494.wheezy
percona-xtradb-cluster-server-5.6 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-server-debug-5.6 5.6.25-25.12-1.wheezy
percona-xtradb-cluster-test-5.6 5.6.25-25.12-1.wheezy

to Debian jessie percona xtradb 5.6 with the following versions of packages:


percona-release 0.1-3.jessie
percona-xtrabackup 2.3.4-1.jessie
percona-xtradb-cluster-5.6-dbg 5.6.29-25.15-1.jessie
percona-xtradb-cluster-client-5.6 5.6.29-25.15-1.jessie
percona-xtradb-cluster-common-5.6 5.6.29-25.15-1.jessie
percona-xtradb-cluster-full-56 5.6.29-25.15-1.jessie
percona-xtradb-cluster-galera-3 3.15-1.jessie
percona-xtradb-cluster-galera-3.x 3.15-1.jessie
percona-xtradb-cluster-galera-3.x-dbg 3.15-1.jessie
percona-xtradb-cluster-galera3-dbg 3.15-1.jessie
percona-xtradb-cluster-garbd-3 3.15-1.jessie
percona-xtradb-cluster-garbd-3.x 3.15-1.jessie
percona-xtradb-cluster-garbd-3.x-dbg 3.15-1.jessie
percona-xtradb-cluster-server-5.6 5.6.29-25.15-1.jessie
percona-xtradb-cluster-server-debug-5.6 5.6.29-25.15-1.jessie
percona-xtradb-cluster-test-5.6 5.6.29-25.15-1.jessie

When I start the node to join the cluster I get the following errors in the log files:


2016-05-24 12:58:13 8254 [Note] WSREP: Read nil XID from storage engines, skipping position init
2016-05-24 12:58:13 8254 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera3/libgalera_smm.so'
2016-05-24 12:58:13 8254 [Note] WSREP: wsrep_load(): Galera 3.9(r93aca2d) by Codership Oy <info&#64;codership.com> loaded successfully.
2016-05-24 12:58:13 8254 [Note] WSREP: CRC-32C: using hardware acceleration.
2016-05-24 12:58:13 8254 [Warning] WSREP: Could not open saved state file for reading: /var/lib/client.sql/test-cluster//grastate.dat
2016-05-24 12:58:13 8254 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2016-05-24 12:58:13 8254 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/client.sql/test-cluster/; base_host = 10.21.97.98; base_port = 14039; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/client.sql/test-cluster/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/client.sql/test-cluster//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum =
2016-05-24 12:58:13 8254 [Note] WSREP: Service thread queue flushed.
2016-05-24 12:58:13 8254 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2016-05-24 12:58:13 8254 [Note] WSREP: wsrep_sst_grab()
2016-05-24 12:58:13 8254 [Note] WSREP: Start replication
2016-05-24 12:58:13 8254 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2016-05-24 12:58:13 8254 [Note] WSREP: protonet asio version 0
2016-05-24 12:58:13 8254 [Note] WSREP: Using CRC-32C for message checksums.
2016-05-24 12:58:13 8254 [Note] WSREP: backend: asio
2016-05-24 12:58:13 8254 [Warning] WSREP: access file(/var/lib/client.sql/test-cluster//gvwstate.dat) failed(No such file or directory)
2016-05-24 12:58:13 8254 [Note] WSREP: restore pc from disk failed
2016-05-24 12:58:13 8254 [Note] WSREP: GMCast version 0
2016-05-24 12:58:13 8254 [Note] WSREP: (2ccf1242, 'tcp://0.0.0.0:14039') listening at tcp://0.0.0.0:14039
2016-05-24 12:58:13 8254 [Note] WSREP: (2ccf1242, 'tcp://0.0.0.0:14039') multicast: , ttl: 1
2016-05-24 12:58:13 8254 [Note] WSREP: EVS version 0
2016-05-24 12:58:13 8254 [Note] WSREP: gcomm: connecting to group 'test-cluster', peer '10.21.97.98:,10.254.60.210:,10.48.49.211:'
2016-05-24 12:58:13 8254 [Warning] WSREP: (2ccf1242, 'tcp://0.0.0.0:14039') address 'tcp://10.21.97.98:14039' points to own listening address, blacklisting
2016-05-24 12:58:16 8254 [Warning] WSREP: no nodes coming from prim view, prim not possible
2016-05-24 12:58:16 8254 [Note] WSREP: view(view_id(NON_PRIM,2ccf1242,1) memb {
2ccf1242,0
} joined {
} left {
} partitioned {
})
2016-05-24 12:58:17 8254 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.53097S), skipping check
2016-05-24 12:58:46 8254 [Note] WSREP: view((empty))
2016-05-24 12:58:46 8254 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():162
2016-05-24 12:58:46 8254 [ERROR] WSREP: gcs/src/gcs_core.cpp:long int gcs_core_open(gcs_core_t*, const char*, const char*, bool)():206: Failed to open backend connection: -110 (Connection timed out)
2016-05-24 12:58:46 8254 [ERROR] WSREP: gcs/src/gcs.cpp:long int gcs_open(gcs_conn_t*, const char*, const char*, bool)():1379: Failed to open channel 'test-cluster' at 'gcomm://10.21.97.98,10.254.60.210,10.48.49.211': -110 (Connection timed out)
2016-05-24 12:58:46 8254 [ERROR] WSREP: gcs connect failed: Connection timed out
2016-05-24 12:58:46 8254 [ERROR] WSREP: wsrep::connect(gcomm://10.21.97.98,10.254.60.210,10.48.49.211) failed: 7
2016-05-24 12:58:46 8254 [ERROR] Aborting

2016-05-24 12:58:46 8254 [Note] WSREP: Service disconnected.
2016-05-24 12:58:47 8254 [Note] WSREP: Some threads may fail to exit.
2016-05-24 12:58:47 8254 [Note] Binlog end
2016-05-24 12:58:47 8254 [Note] mysqld: Shutdown complete

Any idea how to fix this issue to achieve a fully working upgade? Note that eventually I want to upgrade all the nodes to the same version of OS/Percona xtradb

Thanks in advance

  1. Make sure they have the same xtrabackup versions across all nodes, especially the joiner node. It has to have the same xtrabackup version as the donor node.
  2. Check wsrep status on the non-upgraded node first, make sure cluster is in PRIM state.
  3. Try to start the upgraded node once you’re done with #1 and #2/