Hi, I setup an XtraDB cluster with 2 nodes over 1GB NIC. The nodes are not in the same datacenter. Approx 90 miles apart. I also have an arbitrator setup on a 3rd machine. I am encountering 2 issues with the cluster:
If I shut down one node and restart it while there is no activity on the other node, it’s doing an IST sync and everything looks OK.
If I shutdown the node and do one small DML or DDL on the otehr node, when the second node is restarted is doing automatically an SST sync. This I not right. Is there any setup/config parameter that I am missing here?
For some reason the main node is crashing with errors like this (please note that host: ‘nylxdev1’ is the arbitrator):
2014-02-07 17:29:37 24636 [Warning] Aborted connection 674 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
2014-02-07 17:29:37 24636 [Warning] Aborted connection 675 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_1_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_2_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
2014-02-07 17:31:32 24636 [Warning] InnoDB: Cannot open table fixdb/session_3_cache_tbl from the internal data dictionary of InnoDB though the .frm file for the table exists. See [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url] for how you can resolve the problem.
2014-02-07 17:32:42 24636 [Warning] Aborted connection 719 to db: ‘fixdb’ user: ‘fixusr’ host: ‘nylxdev1’ (Got an error reading communication packets)
(working on trying to find out why).
However, after the crash I am trying to restart the node hoping that it will do an SST and everything gets back in sync. However the restart fails with the error:
InnoDB: [url]http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html[/url]
2014-02-10 16:52:04 26472 [Note] Event Scheduler: Loaded 0 events
2014-02-10 16:52:04 26472 [Note] WSREP: Signalling provider to continue.
2014-02-10 16:52:04 26472 [Note] WSREP: inited wsrep sidno 2
2014-02-10 16:52:04 26472 [Note] WSREP: SST received: 3ba0b3bd-83b4-11e3-8b56-47a4208ad584:10746753
2014-02-10 16:52:04 26472 [Note] WSREP: Receiving IST: 381 writesets, seqnos 10746753-10747134
2014-02-10 16:52:04 26472 [Note] /bb/bin/mysql/pxc5.6/bin/mysqld: ready for connections.
Version: ‘5.6.15-25.2-log’ socket: ‘/bb/bin/mysql/sockets/mysql.sock.4319’ port: 4319 Percona XtraDB Cluster (GPL) 5.6.15-25.2, Revision 645, wsrep_25.2.r4027
2014-02-10 16:52:04 26472 [ERROR] Slave SQL: Error ‘Can’t create database ‘fixdb’; database exists’ on query. Default database: ‘fixdb’. Query: ‘CREATE DATABASE fixdb’, Error_code: 1007
2014-02-10 16:52:04 26472 [Warning] WSREP: RBR event 1 Query apply warning: 1, 10746754
2014-02-10 16:52:04 26472 [Warning] WSREP: Ignoring error for TO isolated action: source: 4c9b2f63-9041-11e3-bb74-268d5bf12d49 version: 3 local: 0 state: APPLYING flags: 321 conn_id: 1756 trx_id: -1 seqnos (l: -1, g: 10746754, s: 10746753, d: 10746753, ts: 190772022539233)
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-binary/label_exp/centos5-64/target/percona-build.B12964/src/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)’ failed.
21:52:04 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
Why is the cluster still trying to create fixdb database when this was an SST and I am expecting a clean initialization. There is no create database command run after the engine comes up as no one is connected.
Please advise.
Thank you,
Liviu
This is a fresh install. I also dropped everything and re-added the node. This node crashes all the time. If I create a table on one side on this node I do not see the *.frm file. I see only the ibd file created. Because of this when teh table is dropped (this is part of testing) I get errors as the table doesn’t exist and node crashes. Do you know why the *.frm extension is not created?
I have re-created the cluster after cleaned the fodler and re-created a fresh mysql database with mysql_db_install. cluster doesn’t crash anymore. However i still ahve the initial problem. Whenever I stop a node, if i make one small change on the oterh node, when I bring up the stopped node it does a full SST. My galera.cache is set to 5 GB. There is no way a drop database command will need more. Do you know if there is a setting that triggers SST in this case?
shows that it is still doing an SST. If the node does an SST and copies all the data why would I need an IST after that?
If i set the wsrep_sst_method=none it will fail and not join the node to the cluster.
The message is a bit confusing and i am not sure how to confirm if it does a full state transfer or an incremental one.
Note:
While checking logs you should mainly concentrate on [COLOR=#252C2F][ERROR] lines, and and try to fix it, also avoid using deprecated variables in config file (check [COLOR=#252C2F][Warning] lines in log).
Tx for the input. I will review the links. From the above posted error message what do you understand? Did it do an SST or an IST?
If it did a full state transfer (SST), it means that all data files have been copied over. However I checked the timestamp of the databases that have not been touched and they did not change. The only timestamp that has changed was for the database folder that was touched while the node was down. This would be an indication for me that it actually did an IST only (applied only needed writesets). But the message is still confusing. I want to clarify this before setting the cluster in PROD and creating large databases.