Improve a query to select random rows, but without repeating records

santisaez · April 20, 2008, 2:50am

Dear Srs,

I have a table with near 1.000.000 rows, and I want to select random rows, but without repeating records.

If I don’t prevent repeating records MySQL works fine, this is a PHP code fragment:

$result = mysql_query(“SELECT MIN(id) AS min , MAX(id) AS max FROM tabla”);$rango = mysql_fetch_object($result);$rand_id = mt_rand($rango->min, $rango->max);$result = mysql_query(“SELECT * FROM tabla WHERE id >= $rand_id LIMIT 0,1”);

This is soo fast, MySQL rocks

For preventing repeating rows, I’m making this:

There’s another column to mark if a record was selected previosly or not
Select rows not market at selected
Get the record and mark it as selected
Go to 1)

The step 2) has a WHERE statement that slows de query to more than 10 seconds…

This is the table definition:

mysql> SHOW CREATE TABLE ip_assigned \G*************************** 1. row *************************** Table: ip_assignedCreate Table: CREATE TABLE ip_assigned ( id int(11) NOT NULL auto_increment, ip varchar(256) NOT NULL, assigned_time timestamp NOT NULL default ‘0000-00-00 00:00:00’, crawler int(11) NOT NULL, PRIMARY KEY USING BTREE (id), KEY index_ip (ip(16)), KEY index_crawler (crawler,ip)) ENGINE=MyISAM AUTO_INCREMENT=871539 DEFAULT CHARSET=utf81 row in set (0.00 sec)mysql>

This is the query to get not market records:

// Get COUNT(*) FROM cache, cache->total_ips$sql = “SELECT val AS total_ips FROM cache WHERE param = ‘total_ips’”;$result = mysql_query($sql);$cache = mysql_fetch_object($result);// Calculate a random offset for the LIMIT$offset = mt_rand(1, $cache->total_ips);// Get $total_requested_ips random IPs$sql = “SELECT ip FROM ip_assigned WHERE crawler = 0 LIMIT $offset,$total_requested_ips”;$result = mysql_query($sql);

This last query is very slow, because MySQL needs to scan a lot of rows, this is the output of EXPLAIN command:

What can I do to improve this query? perhaps another method/idea to prevent repeating random registers? Thanks!

Regards,

santisaez · April 20, 2008, 2:58am

Dear Srs,

This is the output for mysqldumpslow:

~# mysqldumpslow /var/log/mysql/mysql-slow.logReading mysql slow query log from /var/log/mysql/mysql-slow.logCount: 3 Time=2.00s (6s) Lock=0.00s (0s) Rows=10240.0 (30720), root[root]@localhost SELECT ip FROM ip_assigned WHERE crawler = N LIMIT N,N

Regards,

barryhunter · April 24, 2008, 9:49am

I was comming here looking for how to do the same (effiecnty get random rows), your idea looks very promising, as for avoiding duplicates, how about just keeping track of the ‘done’ in php?

$result = mysql_query(“SELECT MIN(id) AS min , MAX(id) AS max FROM tabla”);$rango = mysql_fetch_object($result);$done = array(0);for($i=0;$i<10;$i++) { $rand_id = mt_rand($rango->min, $rango->max); $result = mysql_query(“SELECT * FROM tabla WHERE id >= $rand_id AND id NOT IN (”.implode(‘,’,$done).“) LIMIT 0,1”); $resobj = mysql_fetch_object($result); //do something with the object… $done = $resobj->id;}

(note the 0 in the definition of the array - that is so there is something - that will never match - you could do it by making IN clause only appear if something (amongst other ways) - but I like this simple way - should not add to the query time?)

– note this is untested - about to go and try it!

(edited as already spotted a mistake!)

rmatakajr · June 5, 2008, 4:24pm

Hey,

I too was faced with the same problem with over 17,000 rows,
i needed to return a rand() without duplicates but decided to use PHP for rand() when i got the array i needed.

So I ended up using limit clauses to grab the result set
then I used NOT IN condition to avoid duplicates

$feed_array=$dbc->query(“select m.market_feed_id from market_feed m WHERE m.market_feed_id not in(select p.processed_id from que_processed p) LIMIT 1000”);if($dbc->db_num_rows($feed_array) >0){ $count=0; $max_que=1000; while($Mvalues=$dbc->db_fetch_array($feed_array)) { $count++; // Insert into ads processed $insert_array=array(‘processed_id’=>$Mvalues[‘market_feed_id’], ‘processed_type’=>‘market’); $dbc->db_insert(‘que_processed’,$insert_array); if($count == $max_que) { break; } }}else{ echo ‘NO MORE IDS TO GRAB!!!

’;}

I have 2 tables

Where the data exists
A holding table (que_processed)

As you can see I have selected only the id’s
not in(que_processed)

So now i have 1000 fresh IDs to work with in an
array, to do all types of nasty crap too!

You should use a db_num_rows() > 0 to check if the
above query has processed all the id’s have been
processed are not

Preformance wise?? Not sure yet, but the code SNAPS
very fast, mabee its because im limiting 1000 at a time?

Have Fun
Rick
http://www.americashostingnetwork.com

Topic		Replies	Views
Fastest way to Select 1 random result Other MySQL® Questions	1	412	April 28, 2009
Help order by rand before group by Other MySQL® Questions	6	1391	November 10, 2009
Speed up large table SQL select query. Other MySQL® Questions	3	755	September 28, 2008
PHP/MySQL slow_queries Other MySQL® Questions	14	1183	February 28, 2008
yet another query performance issue (maybe related to subselect?) Other MySQL® Questions	4	464	December 3, 2007

Improve a query to select random rows, but without repeating records

Related topics