IST fallback to SST due to large seqno gap

Hi all,

During the MySQL service startup (following a manual data restore (copying data) to a node), the IST process failed. The error log contained the following warning:

[Note] [MY-000000] [WSREP] may fallback to sst. ist_seqno [ x ] < safe_ist_seqno [x + y]

In general, what determines the maximum valid gap between ist_seqno and safe_ist_seqno for IST to be possible?

We have observed inconsistent behavior during our tests:

  1. In one instance, IST was successful even though the sequence gap was relatively larger.
  2. In a subsequent test with a smaller gap, the node failed to perform IST and defaulted to SST. :open_mouth:

What could cause a smaller gap to be rejected while a larger one is accepted?

Could you please describe in more detail how synchronization from the GCache works?

My understanding is that if the seqno found in a joiner’s grastate.dat is lower than the donor’s wsrep_local_cached_downto, an IST (Incremental State Transfer) cannot occur, and the node must perform a full SST (State Snapshot Transfer). Is this correct?

Additionally, I frequently see this specific line in the error logs: [WSREP] may fallback to sst. ist_seqno [ x ] < safe_ist_seqno [x + Y]

What exactly do ist_seqno and safe_ist_seqno represent in this context, and why is the “safe” value often significantly higher than the joiner’s request?

What was the gcache.size defined?

You can take a look at this detailed blog about gcache

1 Like

Also, checkout the value ist_only which only allows IST. If it cannot IST, mysqld exits instead of fallback to SST.

1 Like

Thank you both @matthewb and @Yunus for your inputs.

Indeed, I came across an incredibly helpful article by Krunal Bauskar that explains the inner workings of IST in great detail.

As the article points out, the contents of the GCache can vary from node to node. Therefore, to determine if an IST is possible, it is critical to compare the sequence number (seqno) from the joining node’s grastate.dat file against the wsrep_local_cached_downto value on the donor node.

The rule is simple: If the donor’s wsrep_local_cached_downto is higher than the joiner’s seqno, an IST cannot occur.

Additionally, the article mentions a “safety gap” mechanism. This internal buffer ensures that the donor doesn’t offer an IST if the requested data is dangerously close to being purged from the cache during the transfer. The formula for this safety margin is:

safety gap = (Current State of Cluster – Lowest available seqno from any existing node) * 0.008

Below, I am attaching the original article by Krunal Bauskar, which elaborates on these topic.