Hi, I have a 3 node PXC 5.6 cluster in Google Cloud which is a slave to an external MySQL 5.5
And approx once a day random node drops from cluster. It heals itself automatically. But if the node which is slave to external master drops from quorum, then replication stops. And manual fix is required.
Log:
2015-10-07 00:58:42 3049 [Note] WSREP: (632e46c3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.240.0.4:4567
2015-10-07 00:58:43 3049 [Note] WSREP: (632e46c3, 'tcp://0.0.0.0:4567') reconnecting to a8001e25 (tcp://10.240.0.4:4567), attempt 0
2015-10-07 00:58:44 3049 [Note] WSREP: evs::proto(632e46c3, OPERATIONAL, view_id(REG,632e46c3,188)) suspecting node: a8001e25
2015-10-07 00:58:44 3049 [Note] WSREP: evs::proto(632e46c3, OPERATIONAL, view_id(REG,632e46c3,188)) suspected node without join message, declaring inactive
2015-10-07 00:58:45 3049 [Note] WSREP: view(view_id(NON_PRIM,632e46c3,188) memb {
632e46c3,0
} joined {
} left {
} partitioned {
a8001e25,0
})
2015-10-07 00:58:45 3049 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-10-07 00:58:45 3049 [Note] WSREP: Flow-control interval: [128, 128]
2015-10-07 00:58:45 3049 [Note] WSREP: Received NON-PRIMARY.
2015-10-07 00:58:45 3049 [Note] WSREP: view(view_id(NON_PRIM,632e46c3,189) memb {
632e46c3,0
} joined {
} left {
} partitioned {
a8001e25,0
})
2015-10-07 00:58:45 3049 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6783313)
2015-10-07 00:58:45 3049 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-10-07 00:58:45 3049 [Note] WSREP: Flow-control interval: [128, 128]
2015-10-07 00:58:45 3049 [Note] WSREP: Received NON-PRIMARY.
2015-10-07 00:58:45 3049 [Note] WSREP: New cluster view: global state: 64dccf41-5df8-11e5-96d7-d7aedca679d6:6783313, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2015-10-07 00:58:45 3049 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-10-07 00:58:45 3049 [Warning] Slave SQL: Error in Xid_log_event: Commit could not be completed, 'Deadlock found when trying to get lock; try restarting transaction', Error_code: 1213
2015-10-07 00:58:45 3049 [ERROR] Slave SQL: Error 'WSREP has not yet prepared node for application use' on query. Default database: 'api'. Query: 'UPDATE `servers` SET `load` = 65 WHERE `id` = 15', Error_c
ode: 1047
2015-10-07 00:58:45 3049 [Warning] Slave: WSREP has not yet prepared node for application use Error_code: 1047
2015-10-07 00:58:45 3049 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000155' position 25650818
2015-10-07 00:58:45 3049 [Note] WSREP: New cluster view: global state: 64dccf41-5df8-11e5-96d7-d7aedca679d6:6783313, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2015-10-07 00:58:45 3049 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Relevant rows from config:
innodb_buffer_pool_size=1G
innodb_buffer_pool_instances=1
innodb_autoinc_lock_mode=2
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=2
innodb_file_per_table
wsrep_provider_options="gcache.size=2G; gcs.fc_limit=128"
wsrep_sst_method=xtrabackup-v2
wsrep_slave_threads=4
wsrep_cluster_address=gcomm://us1,us2,us3
log-slave-updates=1 #only us1 node, which is slave to external master
I use n1-highcpu-4 (4 vCPUs, 3.6 GB memory) VM type.
All VM’s in the same region/zone.
Database size ~1.1GB.
InnoDB tables.
No spikes in load.
Only simple SELECTS from indexed tables and simple one row INSERTS.
SELECTS from read-only replicated database, INSERTS to separate database.
Statistics from last half hour (but almost the same all time):
[–] Up for: 29m 48s (5K q [3.114 qps], 1K conn, TX: 3M, RX: 6M)
[–] Reads / Writes: 10% / 90%
No swap.
20GB SSD persistent disk.
root@us1:~# free -m
total used free shared buffers cached
Mem: 3559 2539 1019 0 132 1563
-/+ buffers/cache: 844 2714
Swap: 0 0 0
root@us1:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/4b493502-1d72-41b3-b451-9862214e43a0 20G 11G 8.4G 56% /
Do you have any insights what could cause this PXC on Google Cloud to be so unstable?
I have another identical PXC on cheap ovh.com VM’s (1xCPU, 4GB RAM, 20GB SSD) and it works flawlessly