Issues with Xtradb cluster

Hi, I setup an XtraDB cluster with 2 nodes over 1GB NIC. The nodes are not in the same datacenter. Approx 90 miles apart. I also have an arbitrator setup on a 3rd machine. I am encountering 2 issues with the cluster:

  1. If I shut down one node and restart it while there is no activity on the other node, it’s doing an IST sync and everything looks OK.
    If I shutdown the node and do one small DML or DDL on the otehr node, when the second node is restarted is doing automatically an SST sync. This I not right. Is there any setup/config parameter that I am missing here?
  2. For some reason the main node is crashing with errors like this (please note that host: ‘nylxdev1’ is the arbitrator):
    2014-02-07 17:29:37 24636 [Warning] Aborted connection 674 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
    2014-02-07 17:29:37 24636 [Warning] Aborted connection 675 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
    2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_1_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
    2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_2_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
    2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_3_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
    2014-02-07 17:32:42 24636 [Warning] Aborted connection 719 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
    (working on trying to find out why).

However, after the crash I am trying to restart the node hoping that it will do an SST and everything gets back in sync. However the restart fails with the error:
InnoDB: [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url]
2014-02-10 16:52:04 26472 [Note] Event Scheduler: Loaded 0 events
2014-02-10 16:52:04 26472 [Note] WSREP: Signalling provider to continue.
2014-02-10 16:52:04 26472 [Note] WSREP: inited wsrep sidno 2
2014-02-10 16:52:04 26472 [Note] WSREP: SST received: 3ba0b3bd-83b4-11e3-8b56-47a4208ad584:10746753
2014-02-10 16:52:04 26472 [Note] WSREP: Receiving IST: 381 writesets, seqnos 10746753-10747134
2014-02-10 16:52:04 26472 [Note] /bb/bin/mysql/pxc5.6/bin/mysqld: ready for connections.
Version: ‘5.6.15-25.2-log’ socket: ‘/bb/bin/mysql/sockets/mysql.sock.4319’ port: 4319 Percona XtraDB Cluster (GPL) 5.6.15-25.2, Revision 645, wsrep_25.2.r4027
2014-02-10 16:52:04 26472 [ERROR] Slave SQL: Error ‘Can’t create database ‘fixdb’; database exists’ on query. Default database: ‘fixdb’. Query: ‘CREATE DATABASE fixdb’, Error_code: 1007
2014-02-10 16:52:04 26472 [Warning] WSREP: RBR event 1 Query apply warning: 1, 10746754
2014-02-10 16:52:04 26472 [Warning] WSREP: Ignoring error for TO isolated action: source: 4c9b2f63-9041-11e3-bb74-268d5bf12d49 version: 3 local: 0 state: APPLYING flags: 321 conn_id: 1756 trx_id: -1 seqnos (l: -1, g: 10746754, s: 10746753, d: 10746753, ts: 190772022539233)
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-binary/label_exp/centos5-64/target/percona-build.B12964/src/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)’ failed.
21:52:04 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,

Why is the cluster still trying to create fixdb database when this was an SST and I am expecting a clean initialization. There is no create database command run after the engine comes up as no one is connected.
Please advise.
Thank you,
Liviu

Is this fresh 5.6 install or you upgraded from older version to 5.6 that caused this problem…?
looks like some table files got corrupted.

This is a fresh install. I also dropped everything and re-added the node. This node crashes all the time. If I create a table on one side on this node I do not see the *.frm file. I see only the ibd file created. Because of this when teh table is dropped (this is part of testing) I get errors as the table doesn’t exist and node crashes. Do you know why the *.frm extension is not created?

The crashed node is primary or secondary…?,
Try running mysqlcheck -u root -p --auto-repair --optimize --all-databases on that node.

I have re-created the cluster after cleaned the fodler and re-created a fresh mysql database with mysql_db_install. cluster doesn’t crash anymore. However i still ahve the initial problem. Whenever I stop a node, if i make one small change on the oterh node, when I bring up the stopped node it does a full SST. My galera.cache is set to 5 GB. There is no way a drop database command will need more. Do you know if there is a setting that triggers SST in this case?

Actually, here is what I have in my config file:
014-02-12 17:40:08 7128 [Note] WSREP: New cluster view: global state: 539f2f82-9364-11e3-9543-0f9359a9b1e7:14983, view# 33: Primary, number of nodes: 3, my index: 1, protocol version 2
2014-02-12 17:40:08 7128 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-02-12 17:40:10 7128 [Note] WSREP: Running: 'wsrep_sst_rsync --role ‘joiner’ --address ‘10.122.134.116’ --auth ‘’ --datadir ‘/bb/mysqldata1/4319/myisam/’ --defaults-file ‘/bb/bin/mysql/environment/5.6/mysql_4319.cnf’ --parent ‘7128’ ‘’ ’
2014-02-12 17:40:12 7128 [Note] WSREP: Prepared SST request: rsync|10.122.134.116:4444/rsync_sst
2014-02-12 17:40:12 7128 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-02-12 17:40:12 7128 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-02-12 17:40:12 7128 [Note] WSREP: Assign initial position for certification: 14983, protocol version: 3
2014-02-12 17:40:12 7128 [Note] WSREP: Service thread queue flushed.
2014-02-12 17:40:12 7128 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.122.134.116:4568
2014-02-12 17:40:12 7128 [Note] WSREP: Node 1.0 (dbalnjdmqldb01) requested state transfer from ‘any’. Selected 2.0 (dbalnydmqldb01)(SYNCED) as donor.
2014-02-12 17:40:12 7128 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 14983)
2014-02-12 17:40:12 7128 [Note] WSREP: Requesting state transfer: success, donor: 2
WSREP_SST: [INFO] Joiner cleanup. (20140212 17:40:13.927)
WSREP_SST: [INFO] Joiner cleanup done. (20140212 17:40:14.444)
2014-02-12 17:40:14 7128 [Note] WSREP: SST complete, seqno: 14975
2014-02-12 17:40:14 7128 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
2014-02-12 17:40:14 7128 [Note] Plugin ‘FEDERATED’ is disabled.
2014-02-12 17:40:14 7fb9237ee7e0 InnoDB: Warning: Using innodb_additional_mem_pool_size is DEPRECATED. This option may be removed in future releases, together with the option innodb_use_sys_malloc and with the InnoDB’s internal memory allocator.
2014-02-12 17:40:14 7128 [Note] InnoDB: The InnoDB memory heap is disabled
2014-02-12 17:40:14 7128 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-02-12 17:40:14 7128 [Note] InnoDB: Compressed tables use zlib 1.2.3
2014-02-12 17:40:14 7128 [Note] InnoDB: Using Linux native AIO
2014-02-12 17:40:14 7128 [Note] InnoDB: Not using CPU crc32 instructions
2014-02-12 17:40:14 7128 [Note] InnoDB: Initializing buffer pool, size = 20.0G
2014-02-12 17:40:16 7128 [Note] InnoDB: Completed initialization of buffer pool
2014-02-12 17:40:16 7128 [Note] InnoDB: Opened 2 undo tablespaces
2014-02-12 17:40:16 7128 [Note] InnoDB: Highest supported file format is Barracuda.
2014-02-12 17:40:18 7128 [Note] InnoDB: 8 rollback segment(s) are active.
2014-02-12 17:40:18 7128 [Note] InnoDB: Waiting for purge to start

014-02-12 17:40:18 7128 [Note] Server hostname (bind-address): ‘*’; port: 4319
2014-02-12 17:40:18 7128 [Note] IPv6 is available.
2014-02-12 17:40:18 7128 [Note] - ‘::’ resolves to ‘::’;
2014-02-12 17:40:18 7128 [Note] Server socket created on IP: ‘::’.
2014-02-12 17:40:18 7fb9237ee7e0 InnoDB: Error: table tmp.#sql5627_1c9086_1 does not exist in the InnoDB internal
InnoDB: data dictionary though MySQL is trying to drop it.
InnoDB: Have you copied the .frm file of the table to the
InnoDB: MySQL database directory from another database?
InnoDB: You can look for further help from
InnoDB: [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url]
2014-02-12 17:40:18 7128 [Note] Event Scheduler: Loaded 0 events
2014-02-12 17:40:18 7128 [Note] WSREP: Signalling provider to continue.
2014-02-12 17:40:18 7128 [Note] WSREP: inited wsrep sidno 2
2014-02-12 17:40:18 7128 [Note] WSREP: SST received: 539f2f82-9364-11e3-9543-0f9359a9b1e7:14975
2014-02-12 17:40:18 7128 [Note] WSREP: Receiving IST: 8 writesets, seqnos 14975-14983
2014-02-12 17:40:18 7128 [Note] /bb/bin/mysql/pxc5.6/bin/mysqld: ready for connections.
Version: ‘5.6.15-25.3-log’ socket: ‘/bb/bin/mysql/sockets/mysql.sock.4319’ port: 4319 Percona XtraDB Cluster (GPL) 5.6.15-25.3, Revision 706, wsrep_25.3.r4034
2014-02-12 17:40:18 7128 [Note] WSREP: 2.0 (dbalnydmqldb01): State transfer to 1.0 (dbalnjdmqldb01) complete.
2014-02-12 17:40:18 7128 [Note] WSREP: Member 2 (dbalnydmqldb01) synced with group.
2014-02-12 17:40:18 7128 [Note] WSREP: IST received: 539f2f82-9364-11e3-9543-0f9359a9b1e7:14983
2014-02-12 17:40:18 7128 [Note] WSREP: 1.0 (dbalnjdmqldb01): State transfer from 2.0 (dbalnydmqldb01) complete.
2014-02-12 17:40:18 7128 [Note] WSREP: Shifting JOINER → JOINED (TO: 14983)
2014-02-12 17:40:18 7128 [Note] WSREP: Member 1 (dbalnjdmqldb01) synced with group.
2014-02-12 17:40:18 7128 [Note] WSREP: Shifting JOINED → SYNCED (TO: 14983)
2014-02-12 17:40:18 7128 [Note] WSREP: Synchronized with group, ready for connections
2014-02-12 17:40:18 7128 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

Based on the above message I don’t know if it is doing an SST or an IST.

This message:
2014-02-12 17:40:08 7128 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-02-12 17:40:10 7128 [Note] WSREP: Running: 'wsrep_sst_rsync --role ‘joiner’ --address ‘10.122.134.116’ --auth ‘’ --datadir ‘/bb/mysqldata1/4319/myisam/’ --defaults-file ‘/bb/bin/mysql/environment/5.6/mysql_4319.cnf’ --parent ‘7128’ ‘’ ’
2014-02-12 17:40:12 7128 [Note] WSREP: Prepared SST request: rsync|10.122.134.116:4444/rsync_sst

shows that it is requeesting an SST but the next message:

2014-02-12 17:40:18 7128 [Note] WSREP: SST received: 539f2f82-9364-11e3-9543-0f9359a9b1e7:14975
2014-02-12 17:40:18 7128 [Note] WSREP: Receiving IST: 8 writesets, seqnos 14975-14983

shows that it is still doing an SST. If the node does an SST and copies all the data why would I need an IST after that?

If i set the wsrep_sst_method=none it will fail and not join the node to the cluster.
The message is a bit confusing and i am not sure how to confirm if it does a full state transfer or an incremental one.

Correction: The above messages are from the error log not from the config fileas i stated at the begining of the previous post.

@llintes,

SST or IST mainly depends on grastate.dat file, SST happens only when the node is joining or if it feels that it is completely out of sync!.

When you are joining a node u have to make sure that the donor node has no problems/errors, else the joining may not get correct information.

I suggest you go through below links for more understanding

http://www.mysqlperformanceblog.com/…-cluster-node/
http://www.percona.com/doc/percona-x…_transfer.html
[URL=“Cluster Failover”]http://www.percona.com/doc/percona-x.../failover.html[/URL]
[URL=“Frequently Asked Questions”]http://www.percona.com/doc/percona-x...r/5.6/faq.html[/URL]

Note:
While checking logs you should mainly concentrate on [COLOR=#252C2F][ERROR] lines, and and try to fix it, also avoid using deprecated variables in config file (check [COLOR=#252C2F][Warning] lines in log).

Tx for the input. I will review the links. From the above posted error message what do you understand? Did it do an SST or an IST?
If it did a full state transfer (SST), it means that all data files have been copied over. However I checked the timestamp of the databases that have not been touched and they did not change. The only timestamp that has changed was for the database folder that was touched while the node was down. This would be an indication for me that it actually did an IST only (applied only needed writesets). But the message is still confusing. I want to clarify this before setting the cluster in PROD and creating large databases.