Exhausted retries while inserting; (pt-archiver)

Hi all. I have a pt-archiver job that is meant to run all day, but it is exiting after about 30m with this message. I can’t see any reason why the inserts would fail, and the output doesn’t give me any clues.

Output looks like

May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: TIME ELAPSED COUNT
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:00:02 0 0
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:03:46 224 10000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:07:09 427 20000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:10:35 632 30000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:14:15 853 40000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:17:42 1059 50000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:21:11 1269 60000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:24:39 1477 70000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:28:04 1682 80000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:31:32 1889 90000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:34:56 2094 100000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:38:21 2299 110000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:41:48 2506 120000
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: 2021-05-19T21:44:38 2676 128267
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: Started at 2021-05-19T21:00:02, ended at 2021-05-19T21:48:02
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: Source: D=redacted
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: Dest: D=redacted
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: SELECT 128300
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: INSERT 128267
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: DELETE 128250
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: Action Count Time Pct
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: sleep 2565 2565.3479 89.05
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: inserting 128267 45.6143 1.58
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: bulk_deleting 2565 27.3536 0.95
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: select 2567 23.6883 0.82
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: commit 256536 9.8213 0.34
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: rollback 2 0.0001 0.00
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: other 0 208.9058 7.25
May 19 21:48:02 ip-172-22-81-79 do_purge.sh[32735] O: Exiting because retries exceeded.

Any clues how to investigate much appreciated.

FWIW, nothing else is writing to the dest table, which is just a simple table into which the primary key from the old table is being inserted. However, I do have two other jobs running that read from this table to remove orphaned records in a couple child tables.

2 Likes

Hi sydneyos,

In the source code, you can find the following about “retries”:

=item --retries

type: int; default: 1

Number of retries per timeout or deadlock.

Specifies the number of times pt-archiver should retry when there is an
InnoDB lock wait timeout or deadlock. When retries are exhausted,
pt-archiver will exit with an error.

Consider carefully what you want to happen when you are archiving between a
mixture of transactional and non-transactional storage engines. The INSERT to
L<“–dest”> and DELETE from L<“–source”> are on separate connections, so they
do not actually participate in the same transaction even if they’re on the same
server. However, pt-archiver implements simple distributed transactions in
code, so commits and rollbacks should happen as desired across the two
connections.

Does the error happen every time or it was a time off?
I suggest increasing --retries flag from the default and check if it continues to happen.

As per the above, most possible causes are deadlocks or timeout.
I would suggest checking:

  • if there is indeed no transactions involved on the table
  • if the table does not have any referenced FK that might be interfering
  • if configuration timeout values are not super low (as per the logs provided, it timeouts in 1 second)
  • if checking the error.log there is no sign of error, overloading, data corruption
  • check system resources in case there is saturation

Regards

2 Likes

Thanks.

The dest table is super simple (AutoId, ID, DateCreated with default). No foreign keys.

I see some Aborted connection 1860140 to db: messages in the mysql error.log

One question - where is the timeout configured? Are you referring to the sleep length? I’m not clear where you are seeing that or how to change it.

Thank you!

1 Like

Hi again,

I see some Aborted connection 1860140 to db: messages in the mysql error.log

For above message, please check MySQL "Got an error reading communication packet"

One question - where is the timeout configured? Are you referring to the sleep length? I’m not clear where you are seeing that or how to change it.

Timeout can be for example from innodb_lock_wait_timeout or lock_wait_timeout . You can check with show global variables like ‘%timeout%’; , and you can change them either globally or on a per session basis.

Was this a one time off or a recurring issue? You can also re try execution with an increased number of --retries and/or while the server is not under peak load.

Regards

1 Like

Also, using smaller chunks (--limit) would help, you seem to be using 100000. I suggest you start at 1000 and increase only if it is really too slow. Smaller chunks reduces the risk of contention and the need for retry.

2 Likes

Thank you both. @CTutte it is recurring - I’ll check those variables.

@yves.trudeau, yes, good point. That was not intentional )-:

1 Like