pt-online-schema-change: triggers breaking replication

Scenario: Simple Master->Slave setup. I want to add an index to a table on the slave. Replication is caught up and fine. I launch pt-osc to do the work on the slave, specifying “–no-swap-tables”

After the script is done, running some simple checks to compare the two tables, I find that the COUNT(*) is off.

Going through the masters binlog, I can see 14 DELETE statements which is exactly how many the COUNT() is off by.

When one of the DELETE’s came down the repl pipe, the trigger fired off but because those rows hadn’t been copied to the new table yet, nothing happened.

The same goes for UPDATE. Upon doing some checksums between the two tables, we find that values in the new table were not updated because when the trigger for UPDATE on orig table fired off, the row on new didn’t exist yet.

What’s the workaround for this other than doing STOP SLAVE before running the script?

Hi, Can you please provide below details? are you talking about MySQL routines (Stored Procedures/Triggers) for update/delete?

  • MySQL versions of Master/Slave
  • pt-online-schema-change command line which you are running
  • Version of pt-online-schema-change

Let’s assume I am running the most recent versions of everything. It won’t make a difference anyways. Here is an example:

Master has 1000 rows. Slave has same 1000 rows. I want to add an index on to the slave. I am going to use pt-osc. I start pt-osc and leave the slave repl threads running. pt-osc creates the _new table, creates the triggers and begins copying rows from the original table. Lets say that it has now copied the first 500 PK rows over. Someone updates their profile online. Their PK value is 700. UPDATE users SET favColor = ‘blue’ comes from the master through replication. This row gets updated in the original table. The AFTER UPDATE trigger is fired off in attempt to update the new table.

The trigger fails because the new table does not yet have PK 700. What happens now?

@utdrmac, curious if you have any updates on this. in trying to think through the logic before running this against production, i’m running into similar issues.

in particular…

  • row order for rows inserted / updated during pt-osc (resulting table out of order from other replicas)
  • might be possible for trigger to execute before chunk insert (update overwritten by original, copy not deleted by copy trigger). this requires locking during the chunk operation, which might be done at the row level instead of the table (for which the target rows don’t yet exist… so no lock).

for more details…