Cluster Failure after Online Schema Upgrade

[B]Hello ,

Percona Xtra DB Cluster (5.6.30-76.3-56) wholw cluster failed after RSU action. I have 5 nodes in my cluster. I am listing below all steps.

I was adding new column to a table using rolling schema upgrade. Everything was fine and I have done same procedure before many times.

the steps are below. I applied these steps to my three nodes and all three nodes are ok. But on 4. node I applied step 8 and firstly Node4 crashed. After 10 15 seconds whole cluster failed. I am sharing success node log and failed node log bottom,[/B]

[I]1.) disable node on load balancer and wait for active connections to end

2.) SET GLOBAL wsrep_OSU_method=‘RSU’;

3.) SET GLOBAL wsrep_slave_threads=1;

4.) SET GLOBAL foreign_key_checks=OFF;

5.) SET foreign_key_checks=OFF;

6.) SET GLOBAL wsrep_cluster_address=“gcomm://”;

7.) ALTER TABLE tablename ADD COLUMN columnname VARCHAR(4) DEFAULT ‘’;

8.) SET GLOBAL wsrep_cluster_address=“gcomm://xxx.23.yyy.101:4567,xxx.23.yyy.103:4567,xxx.23.yyy.102:4567,xxx.17.yyy.101:4567,xxx.17.yyy.102:4567”

9.) SET foreign_key_checks=ON;

10.)SET GLOBAL foreign_key_checks=ON;

11.)SET GLOBAL wsrep_slave_threads=8;

12.)SET GLOBAL wsrep_OSU_method=‘TOI’;

13.)enable node on load balancer[/I]

After failure I bootstrapped most advanced node and activated cluster again. it says that mysqld got signal 11 . But I have no idea about this failure and why whole cluster failed instead of failed node. is it a bug? maybe I am missing something. help please

I have attached success and fail nodes logs . Also you can find failed node logs below

2017-11-01 16:22:10 3120 [Note] WSREP: GMCast version 0
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
2017-11-01 16:22:10 3120 [Note] WSREP: EVS version 0
2017-11-01 16:22:10 3120 [Note] WSREP: gcomm: connecting to group ‘my_wsrep_cluster’, peer ‘xxx.23.yyy.101:4567,xxx.23.yyy.103:4567,xxx.23.yyy.102:4567,xxx.17.yyy.101:4567,xxx.17.yyy.102:4567’
2017-11-01 16:22:10 3120 [Warning] WSREP: (aa8776f3, ‘tcp://0.0.0.0:4567’) address ‘tcp://xxx.23.yyy.101:4567’ points to own listening address, blacklisting
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers:
2017-11-01 16:22:11 3120 [Note] WSREP: declaring 4d5ca708 at tcp://xxx.23.yyy.103:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: declaring 6268fd11 at tcp://xxx.23.yyy.102:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: declaring f87ad44d at tcp://xxx.17.yyy.101:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: Node 4d5ca708 state prim
2017-11-01 16:22:11 3120 [Note] WSREP: view(view_id(PRIM,4d5ca708,1274) memb {
4d5ca708,0
6268fd11,0
aa8776f3,0
f87ad44d,0
} joined {
} left {
} partitioned {
})
2017-11-01 16:22:11 3120 [Note] WSREP: save pc into disk
2017-11-01 16:22:11 3120 [Note] WSREP: discarding pending addr without UUID: tcp://xxx.17.yyy.102:4567
2017-11-01 16:22:11 3120 [Note] WSREP: gcomm: connected
2017-11-01 16:22:11 3120 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-11-01 16:22:11 3120 [Note] WSREP: Shifting CLOSED → OPEN (TO: 3691718)
2017-11-01 16:22:11 3120 [Note] WSREP: Opened channel ‘my_wsrep_cluster’
2017-11-01 16:22:11 3120 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 4
13:22:11 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [URL]https://bugs.launchpad.net/percona-xtradb-cluster[/URL]

key_buffer_size=25165824
read_buffer_size=131072
max_used_connections=21
max_threads=202
thread_count=9
connection_count=7
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 105041 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x2b5e200
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7fab140a8d38 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x906e45]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x66ac44]
/lib64/libpthread.so.0(+0xf7e0)[0x7fab58eda7e0]
/usr/lib64/libgalera_smm.so(_Z20gcs_group_get_statusPK9gcs_groupRN2gu6StatusE+0x1a9)[0x7fab3c29f769]
/usr/lib64/libgalera_smm.so(_Z19gcs_core_get_statusP8gcs_coreRN2gu6StatusE+0x6f)[0x7fab3c2a31df]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM9stats_getEv+0xafe)[0x7fab3c308a5e]
/usr/sbin/mysqld(_Z17wsrep_show_statusP3THDP17st_mysql_show_varPc+0x25)[0x59da85]
/usr/sbin/mysqld[0x73428c]
/usr/sbin/mysqld(_Z11fill_statusP3THDP10TABLE_LISTP4Item+0xd6)[0x737c46]
/usr/sbin/mysqld(_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x2f1)[0x723ba1]
/usr/sbin/mysqld(_ZN4JOIN14prepare_resultEPP4ListI4ItemE+0x9d)[0x717d6d]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0xdf)[0x6d02ef]
/usr/sbin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4ItemEPS4_P10SQL_I_ListI8st_orderESB_S7_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x250)[0x71a920]
/usr/sbin/mysqld(_Z13handle_selectP3THDP13select_resultm+0x187)[0x71b1a7]
/usr/sbin/mysqld[0x6ef5ed]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0xe30)[0x6f1920]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x628)[0x6f7e18]
/usr/sbin/mysqld[0x6f7fb2]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1896)[0x6fa1e6]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x22b)[0x6fba9b]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x17f)[0x6c216f]
/usr/sbin/mysqld(handle_one_connection+0x47)[0x6c2357]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0x994b6a]
/lib64/libpthread.so.0(+0x7aa1)[0x7fab58ed2aa1]
/lib64/libc.so.6(clone+0x6d)[0x7fab571c9bcd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fa90408c2a0): is an invalid pointer
Connection ID (thread ID): 8853541
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
[URL]http://www.percona.com/software/percona-xtradb-cluster/[/URL]. You may find information
in the manual which will help you identify the cause of the crash.
171101 16:22:11 mysqld_safe Number of processes running now: 0
171101 16:22:11 mysqld_safe WSREP: not restarting wsrep node automatically
171101 16:22:11 mysqld_safe mysqld from pid file /var/lib/mysql/mysqld.pid ended

Thank you for this report and also thank you for logging it on Launchpad - it has been seen and will be documented via that route.
For anyone who arrives on this forum page and needs to follow progress, you can find the entry at [url]https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1729592[/url]

As commented on the bug you are probably hitting [url]Log in - Percona JIRA.
This was regression that affect 5.6.30 and was fixed by 5.6.32

I would strongly re-commend you to re-try the said use-case with latest PXC 5.6.37 (you can also upgrade to 5.7.19 our 10x performance optimized PXC variant).

If you still hit the bug I can surely investigate.

Hello krunalbauskar,

I have upgraded XtraDB Cluster from 5.6.30 to 5.6.37. it seems there are no problem about rolling schema upgrade. But now I am facing another problem. I am using single master but getting deadlock exceptions. There was no any deadlock when I use 5.6.30 version. But now getting about 10 deadlock exception daily. And every deadlock increasing wsrep_local_cert_failures variable. I expect that if there are single master, I can only see mysql level deadlock, but now I am facing certification conflict I guess ?

Single-master … deadlock exception sounds weird.

Can you help with test-case and configuration options?

Hello krunalbauskar,

I have 5 node , 3 nodes are one country and 2 nodes are another one. I have 10 apps . 8 apps write rate is very high and this 8 apps using only one cluster node. other 2 apps read rate is high. So this 2 apps are using other cluster nodes. Deadlocks are cluster level, because I can not see any deadlock in mysql (show engine innodb status) after deadlock exceptions. Deadlocks just comes from 8 apps (write rate is very high and writing only one node). My wsrep_provider options are below,

base_dir = /var/lib/mysql/;
base_host = 172.xx.10.yyy;
base_port = 4567;
cert.log_conflicts = no;
debug = no;
evs.auto_evict = 0;
evs.causal_keepalive_period = PT3S;
evs.debug_log_mask = 0x1;
evs.delay_margin = PT1S;
evs.delayed_keep_period = PT30S;
evs.inactive_check_period = PT10S;
evs.inactive_timeout = PT1M;
evs.info_log_mask = 0;
evs.install_timeout = PT1M;
evs.join_retrans_period = PT1S;
evs.keepalive_period = PT3S;
evs.max_install_timeouts = 3;
evs.send_window = 1024;
evs.stats_report_period = PT1M;
evs.suspect_timeout = PT30S;
evs.use_aggregate = true;
evs.user_send_window = 512;
evs.version = 0;
evs.view_forget_timeout = P1D;
gcache.dir = /var/lib/mysql/;
gcache.keep_pages_count = 0;
gcache.keep_pages_size = 0;
gcache.mem_size = 0;
gcache.name = /var/lib/mysql//galera.cache;
gcache.page_size = 128M;
gcache.recover = no;
gcache.size = 1G;
gcomm.thread_prio = ;
gcs.fc_debug = 0;
gcs.fc_factor = 1.0;
gcs.fc_limit = 16;
gcs.fc_master_slave = no;
gcs.max_packet_size = 64500;
gcs.max_throttle = 0.25;
gcs.recv_q_hard_limit = 9223372036854775807;
gcs.recv_q_soft_limit = 0.25;
gcs.sync_donor = no;
gmcast.listen_addr = tcp://0.0.0.0:4567;
gmcast.mcast_addr = ;
gmcast.mcast_ttl = 1;
gmcast.peer_timeout = PT3S;
gmcast.segment = 0;
gmcast.time_wait = PT5S;
gmcast.version = 0;
ist.recv_addr = 172.xx.10.yyy;
pc.announce_timeout = PT3S;
pc.checksum = false;
pc.ignore_quorum = false;
pc.ignore_sb = false;
pc.linger = PT20S;
pc.npvo = false;
pc.recovery = true;
pc.version = 0;
pc.wait_prim = true;
pc.wait_prim_timeout = PT30S;
pc.weight = 1;
protonet.backend = asio;
protonet.version = 0;
repl.causal_read_timeout = PT30S;
repl.commit_order = 3;
repl.key_format = FLAT8;
repl.max_ws_size = 2147483647;
repl.proto_max = 7;
socket.checksum = 2;
socket.recv_buf_size = 212992;

local certification failure is not an issue till it is not causing your a crash or data inconsistency. LCF is actually protecting from applying of a conflicting transaction.
Can you reproduce the problem on smaller scale and shared the test-case.

Hello krunalbauskar,

it was related about my 2 apps that read rates high. I redirected these 2 apps write queries to my write master node via proxysql and the problem fixed. But before upgrade I have no any deadlock even these 2 apps write different nodes. Maybe it is related upgrade. Currently I am using only one node for all write requests and everything seems ok.

Thank you,