i have a cluster with 5 nodes with a very high write workload 24/7
sometimes a node need to do a IST to syncronize with other nodes, and we ofter got this error during IST:
State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation)
then the node will be restared and an SST is started. The problem is that every node is about 7 TB.
what means that the donor seqno had gone forward during IST?. if the donor recive a write is normal that the seqno increments, isn’t it?
I spoke with the team here, and they said their first instinct might be for you to look around the area of gcache given the size of your nodes… here are some blog posts around the topic, but there are quite a few more on the website here:
If you take a look at those posts and then have further questions around how this works, then do feel free to come back and ask.
There may also be opportunities to rearchitect if the size of the nodes is troublesome e.g. archiving data or functionally sharding… if you are able to look at the issue from that point of view, it might be of benefit.
Outside those suggestions, though, it’s likely that review would need a detailed look into your error logs and so on, and likely to be very much specific to your environment.That’s not so easy to address here in the open source forum which is necessarily geared to generic help. If it’s a key system for you and you could use some input on a professional basis you are welcome to contact me directly (name dot surname at percona.com) and I can introduce you to someone who can look at this option with you.