Random user search


I have some problems on designing db schema and query for random user search based on simple search conditions.

User can execture search by selecting some conditions (sex, age, region and price) and search function will display random user meeting this condition.

I want to design scalable architecture for tremendous users.
So I am considering sharding, replication for ease of maintenance and cache efficiency.
If I want to shard this kind of table, which sharding approach is best?

My first idea is sharding search tables by the combination of the values of search condition. For example, I want to keep all the data of users who is male, living in newyork, age from 15~20 and price is in p1 range in the table named after s_data_m_ny_15~20_p1.
My idea is spliting search tables according to the actual user’s data. Global dictionary table will manage which search table in which shard meets specific search condition and route query to relevant search table.
If I use MyISAM engine, I can manage several Merge tables which manages tables with overlapping search condition (for example, table for just man is named after s_data_m.)

I perfer to use innodb. In case of innodb can I use MySQL partitioning for the same purpose of merge engine in above scenario?

The other reason why I keep data in the relevant search table is select random. I want to execute order by random() select for some search table. This may be slow if data size grows.
So in php random integer is generated and query just select random row the primary key of which is close to this random number.
But in this scenario of random select, how to deal with fragmentation? I will define auto-increment int column as primary key of each search table. If some user modifies his information, other search table meeting new information will manage that user’s data. So delete from old table and insert into new table is required and there are fragment for this record in old table. After some interval, there may be many holes regarding primary key integer value range.
If there are many fragments, the random select may not find row even if there are rows. How to handle this problem for random select?

If I split search tables into several shards, I should maintain redundant overlapping tables. Innodb partitioning or merge engine cannot be applied to tables in different shards.

The partitioning of search table like above scenario has difficulty in applying to physically different shards(servers).

Would someone be so kind as to let me know good insight for this kind of problem?

I think the implementation of my scenario will be too complicated.

For simplicity, old data or data of inactive users may not be maintained in search table.

My explanation on my problem may be unclear.

Thanks in advance for any good insightful opinion.

If you don’t consider partitioning, picking a random user by primary key should be easy:

SELECT user.* FROM users WHERE id > my_random_guess AND my_other_criteria

You don’t need an order by random(), which will perform terribly and you don’t need to worry about id fragmentation since the greater than will take care of that. You just need to have a rough idea of the maximum id (which you can get from SHOW TABLE STATUS). If you’ve got shards and your data is distributed within those shards evenly, then you can just randomly pick a shard to execute the query on. The same goes for merge tables or even partitions, though you’ll have to do something clever to confine your query to a single partition.

To get substantial benefit out of sharding, merge tables or partitioning, you need to make sure your common case has you accessing only one shard, underlying table or partition at a time. If you always have to access every shard, you’re really losing all the benefit and incurring the cost of maintenance and the additional code complexity. For example, if you have a log or reporting table, its usually the case that more recent data is accessed more frequently. Older records can be moved to another shard/table/partition and then your common case will end up having smaller indexes and data, and being faster.

When picking the method that you use to segment your data, consider your ability to control that dimension. Attributes like gender are a bad choice. First off, you’ve only got two values, so you can only have two segments. Also, you may not have an even distribution between male and female, so one of your databases is constantly overloaded while the other has very little load.

Even trying to segment based on something like the modulus of the user_id can be tricky. The advantages are its simple to determine where a user is, and it auto-balances the data when you add a new shard. Unfortunately, all users are not created equal. They may be more or less active, or have more or less data. This can lead to certain segments requiring more resources than others.

It really depends on your common data path and how you see your data set growing over time.

Thanks for your kind reply~

Originally I consider sharding, partitioning but now I determine to use merge engine for some reasons.

  1. partitioning is supported by only 5.1 version which is not production.
  2. The number of rows will be managed within some limit (maybe data of active user during last 3 days). So now I don’t consider sharding for this kind of search table.
    The main purpose of merge engine is the ease of deleting old data without no overhead. I can use drop table after changing merge union list.

There are another issues for my problems.

I want to reduce the replication overhead by reducing update overhead on search table.
I want to update search table by backend cron script periodically to reduce the number of insert(update).
After checking manual and testing replace into merge table, I found replace does not work for merge tables.
Is there a good alternative for replace into merge table?
Otherwise I should select merge table to check if there are data of some user who do some activity recently.
If there is row for this user, I should delete old row from physical sub table and insert new row with new timestamp.
If I need to manually delete and insert, there is no benefit of multi replace query as follows. I want to reduce the overhead of replication and log_flush.

Replace into user_search Select * from queue where last_active_timestamp > beginning and last_active_timestamp <= end VALUES ( , , )

If replace works for merge table, above query can be used to reduce the no of replace.