Improve a query to select random rows, but without repeating records

Dear Srs,

I have a table with near 1.000.000 rows, and I want to select random rows, but without repeating records.

If I don’t prevent repeating records MySQL works fine, this is a PHP code fragment:

$result = mysql_query(“SELECT MIN(id) AS min , MAX(id) AS max FROM tabla”);$rango = mysql_fetch_object($result);$rand_id = mt_rand($rango->min, $rango->max);$result = mysql_query(“SELECT * FROM tabla WHERE id >= $rand_id LIMIT 0,1”);

This is soo fast, MySQL rocks :wink:

For preventing repeating rows, I’m making this:

  1. There’s another column to mark if a record was selected previosly or not

  2. Select rows not market at selected

  3. Get the record and mark it as selected

  4. Go to 1)

The step 2) has a WHERE statement that slows de query to more than 10 seconds…

This is the table definition:

mysql> SHOW CREATE TABLE ip_assigned \G*************************** 1. row *************************** Table: ip_assignedCreate Table: CREATE TABLE ip_assigned ( id int(11) NOT NULL auto_increment, ip varchar(256) NOT NULL, assigned_time timestamp NOT NULL default ‘0000-00-00 00:00:00’, crawler int(11) NOT NULL, PRIMARY KEY USING BTREE (id), KEY index_ip (ip(16)), KEY index_crawler (crawler,ip)) ENGINE=MyISAM AUTO_INCREMENT=871539 DEFAULT CHARSET=utf81 row in set (0.00 sec)mysql>

This is the query to get not market records:

// Get COUNT(*) FROM cache, cache->total_ips$sql = “SELECT val AS total_ips FROM cache WHERE param = ‘total_ips’”;$result = mysql_query($sql);$cache = mysql_fetch_object($result);// Calculate a random offset for the LIMIT$offset = mt_rand(1, $cache->total_ips);// Get $total_requested_ips random IPs$sql = “SELECT ip FROM ip_assigned WHERE crawler = 0 LIMIT $offset,$total_requested_ips”;$result = mysql_query($sql);

This last query is very slow, because MySQL needs to scan a lot of rows, this is the output of EXPLAIN command:

mysql> DESCRIBE SELECT ip FROM ip_assigned WHERE crawler = 0 LIMIT 705607,10240 ;±—±------------±------------±-----±--------------±--------------±--------±------±-------±------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±------------±-----±--------------±--------------±--------±------±-------±------------+| 1 | SIMPLE | ip_assigned | ref | index_crawler | index_crawler | 4 | const | 813308 | Using index | ±—±------------±------------±-----±--------------±--------------±--------±------±-------±------------+1 row in set (0.00 sec)

What can I do to improve this query? perhaps another method/idea to prevent repeating random registers? Thanks!

Regards,

Dear Srs,

This is the output for mysqldumpslow:

~# mysqldumpslow /var/log/mysql/mysql-slow.logReading mysql slow query log from /var/log/mysql/mysql-slow.logCount: 3 Time=2.00s (6s) Lock=0.00s (0s) Rows=10240.0 (30720), root[root]@localhost SELECT ip FROM ip_assigned WHERE crawler = N LIMIT N,N

Regards,

I was comming here looking for how to do the same (effiecnty get random rows), your idea looks very promising, as for avoiding duplicates, how about just keeping track of the ‘done’ in php?

$result = mysql_query(“SELECT MIN(id) AS min , MAX(id) AS max FROM tabla”);$rango = mysql_fetch_object($result);$done = array(0);for($i=0;$i<10;$i++) { $rand_id = mt_rand($rango->min, $rango->max); $result = mysql_query(“SELECT * FROM tabla WHERE id >= $rand_id AND id NOT IN (”.implode(‘,’,$done).“) LIMIT 0,1”); $resobj = mysql_fetch_object($result); //do something with the object… $done = $resobj->id;}

(note the 0 in the definition of the array - that is so there is something - that will never match - you could do it by making IN clause only appear if something (amongst other ways) - but I like this simple way - should not add to the query time?)

– note this is untested - about to go and try it!

(edited as already spotted a mistake!)

Hey,

I too was faced with the same problem with over 17,000 rows,
i needed to return a rand() without duplicates but decided to use PHP for rand() when i got the array i needed.

So I ended up using limit clauses to grab the result set
then I used NOT IN condition to avoid duplicates

$feed_array=$dbc->query(“select m.market_feed_id from market_feed m WHERE m.market_feed_id not in(select p.processed_id from que_processed p) LIMIT 1000”);if($dbc->db_num_rows($feed_array) >0){ $count=0; $max_que=1000; while($Mvalues=$dbc->db_fetch_array($feed_array)) { $count++; // Insert into ads processed $insert_array=array(‘processed_id’=>$Mvalues[‘market_feed_id’], ‘processed_type’=>‘market’); $dbc->db_insert(‘que_processed’,$insert_array); if($count == $max_que) { break; } }}else{ echo ‘NO MORE IDS TO GRAB!!!

’;}

I have 2 tables

  1. Where the data exists
  2. A holding table (que_processed)

As you can see I have selected only the id’s
not in(que_processed)

So now i have 1000 fresh IDs to work with in an
array, to do all types of nasty crap too!

You should use a db_num_rows() > 0 to check if the
above query has processed all the id’s have been
processed are not

Preformance wise?? Not sure yet, but the code SNAPS
very fast, mabee its because im limiting 1000 at a time?

Have Fun
Rick
http://www.americashostingnetwork.com