I’ll start first with the context. We have a table that has around 687 Million rows and every night we import this table with Apache Sqoop to Hadoop with a parallelism of 32 tasks. The table looks like the following:
CREATE TABLE `attribute` ( `resource_id` int(11) NOT NULL, `user_id` int(11) NOT NULL, `name_id` int(11) unsigned NOT NULL, `value_id` int(11) unsigned NOT NULL, `updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`resource_id`,`user_id`,`name_id`), KEY `updated` (`updated`), KEY `name` (`name_id`), KEY `value` (`value_id`), ) ENGINE=InnoDB DEFAULT CHARSET=utf8
As Sqoop does not support splitting the range by a multi-column primary key it uses the resource_id alone to create splits for the 32 tasks. The following is an example of a split condition
SELECT `resource_id`, `user_id`, `name_id`, `value_id`, `updated` FROM `attribute` AS `a` WHERE ( `resource_id` >= 75 ) AND ( `resource_id` < 93 )
The amount of data is quite different (some splits have like ~1000 rows, another has nearly 100 Million rows).
We now faced the problem that we got only 1 row back for such a range and the query was finished without any error. We tried different MySQL connector versions (5.1.42, 5.1.46 and 8.0.11) and got this problem with all of them. We also tried the MySQL CLI and with the same query and had also in rare cases the same problem so that we excluded a problem in the connector or Sqoop. We also tried as Isolation Level repeatable read and read committed and had the problem with both. As stated in the title this happens only in rare cases, but we have no idea yet why it should only return one row and not an error or some random amount.
The MySQL server (version 5.6.40-84.0-1.jessie) is only used for backups and getting data exports. It should only get writes over binlogs from another server.
I would really appreciate if you could give us some hints how we could further look into this.