I have an InnoDB type instance with many million rows:
±----------------±---------------------±-----±----±---- ----------------±---------------+
| Field | Type | Null | Key | Default | Extra |
±----------------±---------------------±-----±----±---- ----------------±---------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | YES | | NULL | |
| description | text | YES | | NULL | |
| full_story | text | YES | | NULL | |
| uri | varchar(255) | YES | MUL | NULL | |
| pubdate | datetime | YES | | 0000-00-00 00:00:00 | |
| pubdate_reverse | int(9) | YES | | NULL | |
| feed_id | int(5) | YES | MUL | NULL | |
±----------------±---------------------±-----±----±---- ----------------±---------------+
I’m doing an ETL on this data to externalize some of it - but the extract process is taking forever and I’ve determined its IO bound.
What can I do to speed things up? I really want to dump the entire table contents to files. But I wanted to do it in individual files that made the transform and load easy - so I was querying and generating a file - in a loop.
Here is an example query:
Select id, title, uri, pubdate from entries where feed_id = 151 order by pubdate_reverse limit 1000;
1000 rows in set (3 min 34.11 sec)
Yes that’s 3+ minutes for 1000 rows. I’m iterating int he value of feed_id increasing it for each query.
Explain the query:
+—-+————-+———+——+——————-+— —————-+———+——-+——-+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+———+——+——————-+— —————-+———+——-+——-+————-+
| 1 | SIMPLE | entries | ref | index_feed_recent | index_feed_recent | 5 | const | 18032 | Using where |
+—-+————-+———+——+——————-+— —————-+———+——-+——-+————-+
Seems fine.
Looking at iostat while the query is running:
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.00 0.20 54.55 0.00 44.73
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 255.00 221.80 70.60 49.00 2580.80 2159.00 39.63 3.88 32.34 7.66 91.60
sdc 235.20 216.80 65.40 47.80 2360.00 2104.60 39.44 3.82 33.53 7.23 81.80
dm-0 0.00 0.00 628.00 535.20 5024.00 4263.60 7.98 39.04 33.41 0.85 99.20
I see 91-99% disk utilization which makes me think the process is IO bound.
Question is: what now?
Jim