Cannot restart after data update

cmullican · June 19, 2017, 12:37pm

We are currently testing a 3-node Percona cluster in our dev environment. Our production and test DBs are still basic MySQL master / replica configurations. Previously, I had written a script to update dev or test with data from one of the other environments, which worked acceptably. I am now attempting to update this process to populate the cluster with data from one of the other two systems, but having some difficulty. My current iteration stopped mysql on all three nodes before populating one with the new data. However, I am now unable to restart mysql on the node with the new data. The error log isn’t telling me anything I’m finding useful; the last line is simply

Log of wsrep recovery (--wsrep-recover):

with no further information, and I’m not having a lot of luck searching for what this indicates.

I have not yet tried to start mysql on either of the other two nodes. I have tried several different bootstrap command examples on this node, with the same result.

Help?

matthewb · June 19, 2017, 1:47pm

Hello. Why are you stopping the cluster? This should not be necessary. Leave all 3 nodes online and do a data load like normal to a single node. This will replicate automatically to the other nodes.

As for bringing the cluster back online, cat /var/lib/mysql/grastate.dat and bootstrap the node with the highest sequence number. If 5.7, you’ll need to set safe_to_bootstrap: 1 on that node too.

cmullican · June 19, 2017, 2:53pm

Because when I left the other nodes up last time, several schemas got overwritten with nothing – they were initially populated, then later had no tables – and my working theory was that it happened because they weren’t populated on the other nodes when I started the process. Since I haven’t been able to find any kind of recipe / how-to on this scenario, trial and error seemed like the best option, and this was my next attempt.

I tried setting safe_to_bootstrap = 1 in the .cnf file, then running

/etc/init.d/mysql bootstrap-pxc

but the result is the same – no start, nothing informative in the log, only the same message as before.

matthewb · June 19, 2017, 3:05pm

There should have never been a discrepancy between the nodes in the first place. That is an odd situation you got in to. Did you disable wsrep_on=OFF? or Did you set wsrep_osu_method=TOI? These are both a-typical configurations but could have explained what originally happened to your setup.

What version of PXC? Without knowing a lot more information, it will be tough to diagnose via forum posts. Check out scenario #6 on this post: [url]https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/[/url]

cmullican · June 19, 2017, 3:37pm

version 5.7.17

Neither of those variables are set.
It is likely that I have fouled up something in the attempt to bring the new data over to the cluster. I am new to Percona, and was not able to find a useful model to follow, so I simply adapted my existing process which used MySQL Enterprise Backup to populate a master / replica setup.

I stopped mysql on all three nodes. I used innobackupex to take a backup on my source server, rsynced to one of the nodes, applied the log, and move-backed the data (not enough disk space for a copy-back). Now I’m trying to bring the node back up with this new data. The idea was to then bring back the other two and let them get the data from the first one. My first attempt was similar, but without stopping the other two nodes, and resulted in many schemas losing content, as previously described.

Now, clearly, I’ve missed / misunderstood something somewhere along the way, but I don’t know where I went off the path or how to get back.

matthewb · June 19, 2017, 4:05pm

That is a good process. And you’re telling me that you cannot bring that node online? Remove the log file and attempt to bootstrap it again. Attach your error log here so I can see along with your my.cnf

cmullican · June 19, 2017, 4:31pm

Just to check – when you say “remove the log file”, you mean the error.log, or is there another relevant log?
A fresh error log, after an attempted restart, contains only

Log of wsrep recovery (--wsrep-recover):

Cnf file attached. (.txt added as .cnf upload not allowed; it’s not on the actual file.)

mphdb21.cnf.txt (3.46 KB)

matthewb · June 19, 2017, 4:36pm

Ok. I see a few things wrong. 1) “wsrep_replicate_myisam=ON” will not work with 5.7. You need to be 100% innodb. 2) “innodb_buffer_pool_instances = 64” this is way too many instances for only 80G of buffer pool. Knock this setting down to 10. Yes, I want you to delete this log “/var/log/mysql/error.log” and check that it is being re-created during bootstrap.

But here’s the big issue: “safe_to_bootstrap = 1” does not belong in my.cnf; This parameter belongs in the grastate.dat file. Remove this from my.cnf. Edit grastate.dat and change the value appropriately.

cmullican · June 19, 2017, 4:53pm

Hmm. The MyISAM replication is a problem; we’re mostly InnoDB, but a handful of tables haven’t been converted yet. It can be fixed, though. I commented it out for now, and that seems to have made the difference, along with moving the “safe_to_bootstrap” to the correct file. (I did the latter without the former, with no change, but doing both did the trick.)

The buffer_pool_instances is probably an artifact of basing the file on one from a larger server and not fully updating; my boss actually did that part of the setup rather than waiting for me to get to it. I’ve also adjusted that.

matthewb · June 19, 2017, 5:57pm

Glad to hear you got it all working!

cmullican · June 19, 2017, 6:29pm

Thank you so much for your help so far. I’ve got my repopulated node up, but the next one is timing out trying to sync at startup. I’m assuming there’s a variable somewhere controlling this, but I’ve been unable to locate it. (Ubuntu 16.04, if it’s relevant.)

Would it make more sense / be simpler to rsync the backup to all three nodes, apply-log and move-back, instead of letting SST bring the other two in sync?

matthewb · June 19, 2017, 6:55pm

Depending on the dataset size, initial node sync can take several hours. A good estimate is 100GB/hr on a GigE setup. Make sure the SST user exists on the first node and that it has correct permissions and the correct password as configured in your my.cnf. This process will create several logs within the $datadir so you can watch what is happening.

cmullican · June 19, 2017, 7:03pm

Yes, the user exists with the correct credentials. I see innobackup.backup.log populating on the donor system; not sure what, if anything I should be monitoring on the joining system. My concern is that the startup process is terminating, not waiting for the sync, so I’m not sure what will happen when the backup on the donor system completes.

matthewb · June 19, 2017, 7:04pm

Even if /etc/init.d/mysql times out, the sync should still finish. Just “tail -f” the log on the donor. There should be a similar log on the joiner inside a “.sync” folder inside $datadir.

cmullican · June 19, 2017, 7:30pm

Thank you again! I’m not seeing a .sync dir in $datadir, but I do see .sst, though there’s no log file I see there. I’ll keep an eye on the donor’s log. (I like ‘less +F’ better than ‘tail -f’, though).

Topic		Replies	Views
Problem reboot all node Percona XtraDB Cluster 5.x	5	9576	January 16, 2019
Percona XtraDB cluster 5.7 one of the node fails to replicated with other nodes Percona XtraDB Cluster 5.x	1	1127	November 29, 2016
Only able to start first node in cluster Percona XtraDB Cluster 5.x	1	3637	December 10, 2013
Can't Start My Cluster :( It's So Urgent Percona XtraDB Cluster 5.x	3	10268	August 3, 2017
First node fails to restart after bootstrap Percona XtraDB Cluster 5.x	13	20490	April 5, 2017

Cannot restart after data update

Related topics