Batch deletes from big table causing wsrep system lock

On our production server we have a big table which runs into 450 Gb. The table is partitioned based on time into monthly partitions.

The delete is performed using a stored procedure. When we execute the SP we are getting following warning in server logs on all the nodes and at the same time all other transactions are waiting for this SP to complete.

2017-10-03 09:43:10 12355 [Warning] WSREP: Failed to report last committed 33482996, -4 (Interrupted system call)
2017-10-03 09:43:12 12355 [Warning] WSREP: Failed to report last committed 33482997, -4 (Interrupted system call)
2017-10-03 09:43:20 12355 [Warning] WSREP: Failed to report last committed 33482999, -4 (Interrupted system call)
2017-10-03 09:43:33 12355 [Warning] WSREP: Failed to report last committed 33483002, -4 (Interrupted system call)
2017-10-03 09:43:50 12355 [Warning] WSREP: Failed to report last committed 33483007, -4 (Interrupted system call)
2017-10-03 09:44:25 12355 [Warning] WSREP: Failed to report last committed 33483018, -4 (Interrupted system call)

The delete is performed in a while loop and each batch is limited to 10K rows.

Below is the SP, used for the delete operation.

CREATE spc_Util_DeleteTsPassages(p_tsId INT, p_startDate DATE, p_endDate DATE)
BEGIN
tspassages_Block:BEGIN
DECLARE v_tsId, v_deleteStartDate, v_deleteEndDate INT;
DECLARE v_Process VARCHAR(256);
DECLARE v_count BIGINT;

SET v_count = 0;
SET v_tsId = p_tsId;
SET v_Process = ‘tspassages - Data Deletion’;
SET v_deleteStartDate = UNIX_TIMESTAMP(p_startDate);
SET v_deleteEndDate = UNIX_TIMESTAMP(p_endDate);

– select v_tsId, v_deleteStartDate, v_deleteEndDate;

INSERT INTO dataDeleteAuditLogs(PROCESS,LogDate,Descriptions)
VALUES(v_Process,NOW(),CONCAT('Deletion started for tsId : ', v_tsId, '; FromDate : ',p_startDate, ’ ToDate : ', p_endDate));

WHILE EXISTS(SELECT passageTime FROM tspassages WHERE passageTime >= v_deleteStartDate AND passageTime < v_deleteEndDate AND tsId = v_tsId LIMIT 1)
DO
– Deleting from tspassages
DELETE FROM tspassages
WHERE passageTime >= v_deleteStartDate AND passageTime < v_deleteEndDate AND tsId = v_tsId
LIMIT 10000;
SET v_count = v_count + ROW_COUNT();
END WHILE;
INSERT INTO dataDeleteAuditLogs(PROCESS,LogDate,Descriptions)
VALUES(‘tspassages - Data Deletion’,NOW(),CONCAT('Deletion process completed for tsId : ', v_tsId, '; Total records deleted is : ',v_count));
END;
END$$

DELIMITER ;

Table definition

CREATE TABLE tspassages (
rowId bigint(20) unsigned NOT NULL AUTO_INCREMENT,
tsId mediumint(8) unsigned NOT NULL,
camId tinyint(3) unsigned NOT NULL,
passageTime int(11) unsigned NOT NULL,
passageMs smallint(3) unsigned NOT NULL,
passageId smallint(5) unsigned NOT NULL,
lP varchar(15) NOT NULL,
isCalib tinyint(1) unsigned NOT NULL,
Confidence smallint(5) unsigned NOT NULL,
RDWType tinyint(2) unsigned NOT NULL,
PRIMARY KEY (rowId,passageTime),
UNIQUE KEY uidx_tspassages (tsId,camId,passageTime,passageMs),
KEY Idx_tspassage_passagetime (passageTime),
KEY idx_tspassage_tsId (tsId)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY RANGE (PassageTime)
(PARTITION 2017JAN VALUES LESS THAN (1485903600) ENGINE = InnoDB,
PARTITION 2017FEB VALUES LESS THAN (1488322800) ENGINE = InnoDB,
PARTITION 2017MAR VALUES LESS THAN (1490997600) ENGINE = InnoDB,
PARTITION 2017APR VALUES LESS THAN (1493589600) ENGINE = InnoDB,
PARTITION 2017MAY VALUES LESS THAN (1496268000) ENGINE = InnoDB,
PARTITION 2017JUN VALUES LESS THAN (1498860000) ENGINE = InnoDB,
PARTITION 2017JUL VALUES LESS THAN (1501538400) ENGINE = InnoDB,
PARTITION 2017AUG VALUES LESS THAN (1504216800) ENGINE = InnoDB,
PARTITION 2017SEP VALUES LESS THAN (1506808800) ENGINE = InnoDB,
PARTITION 2017OCT VALUES LESS THAN (1509490800) ENGINE = InnoDB,
PARTITION 2017NOV VALUES LESS THAN (1512082800) ENGINE = InnoDB,
PARTITION 2017DEC VALUES LESS THAN (1514761200) ENGINE = InnoDB)

Our Prod setup is of 3 nodes running on Server version: 5.6.35-80.0-56-log.

Why am I getting a WSREP system lock for a batch delete operation and how can I mitigate this issue ?

With Regards,
Raghupradeep

Hi Raghu,

There is a good blog article about this warning. The warning itself is just a notification that the thread reporting the last committed transaction was interrupted. The server node will retry automatically.

Regarding the wait, there are a few things you might want to try to optimize your SP. First, I believe the range lookup on passageTime is not using a good index. EXPLAIN will help point you in the right direction, ANALYZE TABLE may help. You might also try reducing your batch size. For further assistance in optimizing your application for PXC, please contact support.

–Dave

Hi Dave,

Thank you for the reply.

I checked the SP’s and it is using the best index for the delete operation based on the explain plan. Moreover, I did reduce the batch size to 5000 and introduced sleep in the SP so that the delete process is not continuous. The single call of this SP will delete records in the range of 2 million in batches. Until I introduced sleep in the SP, the delete process was causing continuous system lock on the other nodes holding all transaction in it until the SP gets executed. With sleep I was able to introduce a small gap so that the other nodes can catch up or can process the DML’s it receive.

What amuse me is why a small batch delete is causing system lock on other nodes. Is this an intended behaviour of Galera replication ? This system lock does not happen for batch inserts. Is this a bug ? Just curious.

Regarding support, we are still evaluating PXC and the servers are pre-prod. So, we have not decided if PXC is the best solution for us.

It will be of great help if you can share if this behaviour (system lock) is an intended one or not.

With Regards,
Raghu