pt-online-schema-change not as online as I hoped...

I used this tool for the first time today, to do an ALTER on a large table. This is on a cluster with 3 nodes.

During some of this time (unfortunately while I was away) the DB cluster became slow to the point of being unresponsive; I can see this from logs from haproxy, which is sitting in front of it, which shows the occasional 2s timeout. I also had pingdom errors and user reports.

In the log of the node issuing the change, I see a few instances of this:

2015-11-27 17:11:28 27983 [Warning] WSREP: Failed to report last committed 333909394, -4 (Interrupted system call)

It’s still running, and will hopefully complete, but below is the output so far.

Any ideas? I wondered whether I would be better off bringing the other two nodes down for the duration of a schema change, even if that requires an SST when they come back up.

No slaves found. See --recursion-method if host db1.ilovefreegle.org has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Child tables:
iznik.messages_groups (approx. 40686 rows)
Will use the rebuild_constraints method to update foreign keys.
Altering iznik.messages
Creating new table…
Created new table iznik._messages_new OK.
Altering new table…
Altered iznik._messages_new OK.
2015-11-27T16:03:28 Creating triggers…
2015-11-27T16:03:28 Created triggers OK.
2015-11-27T16:03:28 Copying approximately 46400 rows…
Copying iznik.messages: 7% 07:35 remain
Copying iznik.messages: 12% 08:30 remain
Copying iznik.messages: 14% 10:16 remain
Copying iznik.messages: 15% 12:30 remain
Copying iznik.messages: 17% 13:57 remain
2015-11-27T16:06:31 Rows are copying very slowly. --chunk-size has been automatically reduced to 1. Check that the server is not being overloaded, or increase --chunk-time. The last chunk selected 1 rows and took 5.298 seconds to execute.
Copying iznik.messages: 20% 13:50 remain
Copying iznik.messages: 21% 16:43 remain
Copying iznik.messages: 21% 18:28 remain
Copying iznik.messages: 27% 15:22 remain
Copying iznik.messages: 27% 16:47 remain
Copying iznik.messages: 29% 16:10 remain
Copying iznik.messages: 32% 15:06 remain
Copying iznik.messages: 34% 17:59 remain
Copying iznik.messages: 34% 18:39 remain
Copying iznik.messages: 39% 16:25 remain
Copying iznik.messages: 43% 14:17 remain
Copying iznik.messages: 43% 16:54 remain
Copying iznik.messages: 43% 17:50 remain
Copying iznik.messages: 43% 19:06 remain
Copying iznik.messages: 43% 20:42 remain
Copying iznik.messages: 47% 18:37 remain
Copying iznik.messages: 47% 20:49 remain
Copying iznik.messages: 47% 21:19 remain
Copying iznik.messages: 48% 21:25 remain
Copying iznik.messages: 49% 21:58 remain
Copying iznik.messages: 49% 22:10 remain
Copying iznik.messages: 51% 20:38 remain
Copying iznik.messages: 52% 21:27 remain
Copying iznik.messages: 52% 22:23 remain
Copying iznik.messages: 53% 21:58 remain
Copying iznik.messages: 54% 22:35 remain
Copying iznik.messages: 54% 22:50 remain
Copying iznik.messages: 56% 21:39 remain
Copying iznik.messages: 57% 21:40 remain
Copying iznik.messages: 57% 21:47 remain
Copying iznik.messages: 57% 21:53 remain
Copying iznik.messages: 57% 22:33 remain
Copying iznik.messages: 58% 22:21 remain
Copying iznik.messages: 60% 21:23 remain
Copying iznik.messages: 60% 22:16 remain
Copying iznik.messages: 60% 22:36 remain
Copying iznik.messages: 60% 23:16 remain
Copying iznik.messages: 62% 21:42 remain
Copying iznik.messages: 65% 19:00 remain
Copying iznik.messages: 67% 18:07 remain
Copying iznik.messages: 67% 17:56 remain
Copying iznik.messages: 67% 19:19 remain
Copying iznik.messages: 68% 19:15 remain
Copying iznik.messages: 68% 19:11 remain
Copying iznik.messages: 69% 18:46 remain
Copying iznik.messages: 69% 18:44 remain
Copying iznik.messages: 69% 18:57 remain
Copying iznik.messages: 69% 19:06 remain
Copying iznik.messages: 70% 19:09 remain
Copying iznik.messages: 71% 18:16 remain
Copying iznik.messages: 72% 17:52 remain
Copying iznik.messages: 72% 18:17 remain
Copying iznik.messages: 72% 18:25 remain
Copying iznik.messages: 72% 18:21 remain
Copying iznik.messages: 73% 17:59 remain
Copying iznik.messages: 73% 17:59 remain
Copying iznik.messages: 73% 18:10 remain
Copying iznik.messages: 74% 17:49 remain
Copying iznik.messages: 74% 17:49 remain
Copying iznik.messages: 75% 17:31 remain
Copying iznik.messages: 75% 17:20 remain
Copying iznik.messages: 76% 17:02 remain
Copying iznik.messages: 76% 16:40 remain
Copying iznik.messages: 76% 16:47 remain
Copying iznik.messages: 76% 17:07 remain
Copying iznik.messages: 76% 17:21 remain
Copying iznik.messages: 77% 16:59 remain
Copying iznik.messages: 77% 16:58 remain
Copying iznik.messages: 77% 17:04 remain
Copying iznik.messages: 77% 17:06 remain
Copying iznik.messages: 78% 16:52 remain
Copying iznik.messages: 78% 16:47 remain
Copying iznik.messages: 78% 17:23 remain
Copying iznik.messages: 78% 17:10 remain
Copying iznik.messages: 79% 16:54 remain
Copying iznik.messages: 79% 16:39 remain
Copying iznik.messages: 79% 16:40 remain
Copying iznik.messages: 79% 16:39 remain
Copying iznik.messages: 80% 16:40 remain
Copying iznik.messages: 80% 16:39 remain
Copying iznik.messages: 80% 16:42 remain
Copying iznik.messages: 80% 16:42 remain
Copying iznik.messages: 80% 16:54 remain
Copying iznik.messages: 80% 17:04 remain
Copying iznik.messages: 80% 17:09 remain
Copying iznik.messages: 80% 16:59 remain
Copying iznik.messages: 81% 16:43 remain
Copying iznik.messages: 81% 16:30 remain
Copying iznik.messages: 81% 16:21 remain
Copying iznik.messages: 81% 16:16 remain
Copying iznik.messages: 82% 16:07 remain
Copying iznik.messages: 82% 16:03 remain
Copying iznik.messages: 82% 15:42 remain
Copying iznik.messages: 83% 15:11 remain
Copying iznik.messages: 83% 15:00 remain
Copying iznik.messages: 83% 14:58 remain
Copying iznik.messages: 83% 14:47 remain
Copying iznik.messages: 84% 14:37 remain
Copying iznik.messages: 84% 14:38 remain
Copying iznik.messages: 84% 14:35 remain
Copying iznik.messages: 84% 14:43 remain
Copying iznik.messages: 84% 14:45 remain
Copying iznik.messages: 84% 14:45 remain
Copying iznik.messages: 84% 14:41 remain
Copying iznik.messages: 84% 14:39 remain
Copying iznik.messages: 85% 14:33 remain
Copying iznik.messages: 85% 14:23 remain
Copying iznik.messages: 85% 14:08 remain
Copying iznik.messages: 85% 13:54 remain

Following up in case it helps other people.

The application-level hangs happened again later in the copy, so I could get a bit more info:
[LIST]
[]I could see that the machine was thrashing. That’s probably because I had an excessive cache size configured; this is a large table and so it’s plausible that if the cache is too large, it’ll cause swapping.
[
]There was a build-up of processes waiting for the wsrep commit. This is probably the cause of the application misbehaving; even though the critical app is running on another database, it seems that the way in which the cluster replication works means that if one database is locked out, replication can queue up behind it in a way that affects other databases.
[/LIST] The change eventually failed with:

2015-11-27T17:57:26 Error copying rows from iznik.messages to iznik._messages_new: Threads_running=121 exceeds its critical threshold 50

So I: [LIST=1]
[]reduced the cache to a more reasonable size to avoid thrashing
[
]shut down the other two nodes to avoid the replication killing other databases
[/LIST] This time I still got the error about copying slowly, and the rate of progress was similar - but the end-user behaviour was better, and it completed successfully. It did require an SST on the other nodes.

So the lessons for me are:
[LIST]
[]check your config
[
]don’t run it in a clustered environment
[/LIST]