One query on a big table vs multiple queries on multiple smalle ones

Hi!

I’m glad I found this web-site. The fact that Peter decided to start a forum makes me twice as glad ) The question that I have is obviously related to MySQL performance as the web-site name implies.

I need to estimate which one of the following approaches provides better performance.

I’m implementing a simple mapping table, which operates in terms of numerical object IDs (OID) and up to “N” numerical aliases (Alias1, …, AliasN) of different alias types (AliasType1, …, AliasTypeN). One OID can have zero or one Alias of the given AliasType associated with it. So, a pair of OID and AliasType uniquely identifies an Alias (but it may not exist at the time of the query). Also, a pair of an Alias and AliasType uniquely identifies an OID.

Since it is not very likely that any given OID has Aliases of all AliasTypes defined, I scratched out the most obvious implementation of a MySQL table that would have the following columns: (OID, Alias1, Alias2, …, AliasN) and every column would be a unique key. In this case my table could get rather sparse and with tens of millions records the efficiency would be rather pathetic.

Another approach suggests creating a single three-column table: (OID, AliasType, Alias), which can have two indicies (OID, AliasType) and (Alias, AliasType). In this case I shrink the width of the table but multiply the row count.

And the last approach I thought of is the derivative of the above. I can split this long 3-column table into N 2-column tables, where every table would contain mappings of an OID to an Alias of a particular AliasType.

Please correct me if I’m wrong in my expectations.

  1. I think that a lookup for a particular Alias or an OID, given the AliasType and either OID or Alias, should be fast in the last design approach. This conclusion is based on the fact that MySQL server would have to make a lookup on a relatively small data set with the luxury of having a unique index built for both types of queries.

  2. If I need to get all the aliases of all types that correspond to a given OID, the earlier design (with a 3-column table) should provide significantly better performance. I think this is true because I would have to fire N queries to every 2-column table versus one query to a 3-column table with a resulting dataset of multiple rows.

Am I correct in the statements above?
May be there is any better solution for MySQL than what I’m thinking about?

Thanks a lot!
/Sergey

Hi Sergey,

Just some quick thoughts/questions that might help )
(I’d be very interested to hear Peter’s thoughts on this too; it’s always good to get a better understanding of these things)

First, some questions to consider that will help determine which approach suits you best:

  • Do you expect the number of OIDs/aliases per aliastype to be distributed evenly across aliastypes?
  • Will data for some aliastypes be fairly static or will the data for all aliastypes be changing a lot?
  • How many different aliastypes do you expect to have? And is this number likely to change much?
  • What is your ratio of per-aliastype queries vs aliastype-independent queries?

I think if you have very different update/access patterns per aliastype then it might make sense to split entries into separate tables per aliastype. Even then, I’d probably only consider it if you need very very high query throughput, and if queries to find aliases from OIDs is very infrequent, and queries to find OIDs for a specific alias and aliastype as very frequent. (because as you say you’d need to do lots of separate queries or a large union/join if there’s a table per aliastype and you need to do a aliastype-independent query)

The hassle of adding new tables for new aliastypes and modifying the queries just doesn’t seem worth it, even if you automate it somehow. One advantage of splitting them up though, if you have very different access patters per aliastype, is that you might be able to get more query cache hits on the tables which have data that changes less.

Personally I’d start by benchmarking your first alternative approach (A single three-column table: (OID, AliasType, Alias), which can have two indicies (OID, AliasType) and (Alias, AliasType) to see whether it meets your performance needs or not. If it does (and I think it should unless you’re dealing with very high throughput and very large data volumes) then it seems better to me because it offers:

  • A pretty standard approach to solving the problem.
  • Flexibility for different select queries
  • Flexibility/simplicity when running updates, complex deletes etc.
  • Less impact when adding aliastypes.

OTOH, if you’re nearly always doing queries per-aliastype and you need very high throughput then maybe your second alternative is better.

Can you benchmark them and post your findings? I’d be interested to know…

HTH,
Toasty

Hi Toasty,

Thanks for the elaborate reply with some serious questioned raised. Here are some answers and additional details.

  • Let’s say there are only 12 AliasTypes. The typical query would provide an Alias of one of 10 AliasTypes and request an Alias of the remaining two AliasTypes as an output. For example, given some Alias9 I have to provide the value of Alias1 via intermediary lookup for an OID. This task can be accomplished via the following steps:
  1. Using the table, which maps aliases of type 9 into OIDs I retrieve an OID that corresponds to Alias9 value.
  2. I use this OID to retrieve the desired Alias1 via corresponding table.

So, roughly two tables will experience 5 times the retrieval load as the remainig 10.

  • The number of AliasTypes is known and limited at approximately 40 types. The overhead of adding more tables as new AliasTypes are discovered is negligible, as it is a very rare case and the software can be shutdown for the maintenance of that sort.

  • Some tables are expected to have fairly static data but I cannot really make any quantative predictions yet.

  • The number of queries per second can be pretty high, in the order of 3000 retrival queries per second. But it is hard to baseline this number as it depends on the hardware configuration and the degree of its utilization by other applications.

  • There is no AliasType-independent queries as any retrival is qualified by an AliasType, otherwise, the lookup is ambigous.

So, these are the additional important details that you pointed out. Hope it helps to define the problem better.

Or, let me take a step back and describe the problem in its original form. Maybe it could bring up some additional ideas on how to implement that in a better way than what I have described earlier.

So, I need to store information about up to 50 million objects with predefined set of attributes. Each object should have a unique ID (I referred to it as OID above), a predermined set of alternative IDs (Aliases in the text above), and the rest of the attributes are just the properties of objects that can be looked up but can never be used as the retrival keys. I receive information in portions with one or more Aliases and zero or more Properties. After I receive every portion of information I have to determine if it belongs to any existing object and either fill-in the empty fields or create a new record with a unique OID. Later, I an be queried for any of the Aliases or Properties given one of the Aliases.

Here is an example. Let’s say, I’m building a list of workstations in the network. Every workstation is uniquely identified by an IP address, MAC address of the NIC, and host name. All these attributes are aliases. Every workstation can also have its brand, model, and OS name as Properties.

The straightforward solution would be to create a single table like that:

CREATE TABLE Workstations ( Oid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, IpAddr INT UNSIGNED, MacAddr BIGINT UNSIGNED, HostName VARCHAR(32), Brand VARCHAR(32), Model VARCHAR(32), OsName VARCHAR(32), PRIMARY KEY (Oid), KEY(IpAddr), KEY(MacAddr), KEY(HostName) );

If I receive a message that has only one Alias (IpAddr, for instance), I check if there is a record with this Alias. If such a record exists, I don’t do anything. Otherwise, I create a record that has only two non-NULL fields: Oid and IpAddr.

If I receive a message that has more than one Alias in it, I can safely assume that all these Aliases correspond to the same object. Thus, I can either create a new record if none of the Aliases exist in the table, or do nothing if all of them return the same record, or merge two or more records if lookups for all these Aliases one by one returned different records.

Similar approach can be used with Properties.

Such an implementation seems to be inefficient in case if I have a sparse table, where only a couple of fields are non-NULL. I don’t care as much about the amount of disk space but I do care about the response time. That’s why I started to think of alternative solutions that I described in the original message. Am I going in the right direction with it? Or can MySQL handle such sparse tables efficiently?

What if my table had about 15 aliases 4 to 8 bytes each and about 400 bytes worth of properties combined? How would MySQL handle such a table if it had about 50 million sparse records of that sort where about 1/3 of alises and properties would be known? Would it be any better than one of those more complex schemas shown in the original message?

Thanks for reading! )
It would be supernice if anybody would express opinion on that.
Thanks again!

/Sergey

Hi Sergey,

Just a quick question to help me while I think about this. I’d just like to make sure that I’ve understood how you do your OID merging/collapsing.

Would it be possible for example, that through partial input information you could have the same real-world object represented by say 10 separate OIDs, each with 4 different aliases defined? Would you then have to try to merge some or all of these OIDs if you got an insert/update request that had a new combination of these aliases provided together?

eg, say you start with:
OID1 only has values for alias1->alias4 = (a,b,c,d)
OID2 only has values for alias5->alias8 = (e,f,g,h)

OID10 only has values for alias37->alias40 = (w,x,y,z)

You then get an insert request that has (alias38=x, alias3=c) so you’d;
Lookup alias38 to see if ‘x’ exists. (‘x’ is assumed to be globally unique for alias38 I presume?)
Lookup alias3 to see if ‘c’ exists. (‘c’ is assumed to be globally unique for alias3 I presume? etc)

You notice that they both exist, so you create a new OID:
OID11 has values alias1->alias4 = (a,b,c,d) and alias37->alias40 = (w,x,y,z)
AND delete OID 1 and OID 10

I take it that as part of creating OID11 from merged values of OID1 and OID10 you’d never get conflicts of property (non-alias) values?
I also take it that every alias value is unique so you could never get alias value clashes on OID merge?

Cheers,
Toasty

Hi Toasty,

You are right on target. Here is a few more details that pertain to that “merge/collapse” scenario:

  1. It is not determined if a merge creates a new record with new OID, or if it uses one of the existing records and just fills in some of the fields in it and drops the other record.

  2. The above scenario becomes more funky if I get a contribution that has three or more aliases (z.B., (Alias3, Alias20, and Alias37)). In that case I have to merge three records, and I either have to determine, which one of the three OIDs to use, or, as you suggested, to create a completely new row with a new OID. I like the latter approach better for its genericness but I need to think whether it is ok for the pupose of the application.

  3. Merge scenario may also have a hard-to-handle condition when the two rows being merged have overlapping aliases defined and the values of overlapping aliases may be conflicting. This may very well be a synthetic scenario and I need to think more to say if it is realistic at all.

/Sergey

One more question Sergey,

What are the ‘external’ operations that this database will be required to support (irrespective of how it’s designed)?

Will it mostly be used for inserting/merging/updating data about OIDs or will there also be a heavy select load in parallel?

From what you’ve said so far I’m not 100% sure whether the selects which will do lookups by alias are generated from the need to merge OIDs (i.e. to see if any matching data exists for a particular alias) or from genuine ‘external’ requests.

Tar,
Toasty

Thanks for your ideas Toasty! Here is more details on the things you were wondering about.

When the database is empty, the majority of operations are going to be merges, insertions, and deletions (updates are covered by merges). As it collects more details about the objects, the balance will shift towards mostly select operations. At the limit, given that we know all the properties of all the objects, we would still observe some amount of merges and other data altering operations going on, since the nature of aliases is such that some of them are being updated periodically and my storage has to learn these new mappings (for instance, in the example above, the IP address could have been reassigned by an DHCP server). For simplicity, we can assume that the old aliases are expired once the new ones arrive.

Cheers,
/Sergey

OK I’ve had a bit more of a think about this and I think I’m changing my mind a bit )

I think the main thing to aim for is minimising the size of the data and indexes here as 50M objects with 40 unique properties (aliases) plus a bunch more non-searchable properties results in a fairly big database. Combine that with a 3K/sec query rate and you definitely need to be careful. (and have a lot of RAM )

If you go with the original “one big table with a column per alias and a column per property” approach, then you’re going to end up with the problem of managing a >50M row, 41 col table with 41 unique indexes on it (one per alias, and one for the OID)

This will have really massive RAM requirements (you’ll want the indexes in RAM if you’re doing 3K queries/sec unless you’re willing to scale out/up your DB platform or partition your data) and is generally not a great approach IMO. (It’s possible that your ‘hot’ data is a very small subset of the data stored in your DB so you might get away with this, but in your case it doesn’t really sound like it)

Your alternative approach where you have a “single three-column table: (OID, AliasType, Alias)” falls down a bit I think, because I presume that each alias could be of different type? (e.g. What would you define the ‘Alias’ column to be in this approach? A varchar(255) just in case? ;] )
Otherwise it’s ok I guess, other than the fact that for 50M objects you’d have up to 50M * 40 rows (one for each alias for each object) so your table could have up to 2,000,000,000 rows. (more likely 750M?) Not fun to manage I imagine.

If you take your other approach - splitting alias values into separate tables by aliastype - then you’ll still have say 750M rows, just split across 40 tables. This makes the data more manageable from an admin POV, but the number of select queries you’re going to have to do increases roughly by a factor of N, where N is the average number of given aliases in any RequestData/UpdateData demand you receive. In addition, upon OID merge, you’re going to need to run inserts and deletes per aliastype table which means that you probably should use explicit transactions, and also that you’re going to have to run more, but smaller, DML statements (not usually a problem in itself).
All of these statements will basically be unique key lookups though, so will be super fast.

So, although I have a niggly feeling that there’s some vastly better way to do this, it seems that the approach with one table for OID and non-alias properties, and another 40 tables keyed on aliasvalue to lookup OID is the way to go. You’re still going to have to do some serious micromanagement (be careful with your column definitions / tune your server and mysql config / etc) and have some fairly meaty hardware to get the performance you want I think.

Another advantage of the table per aliastype approach is that if you really need to - and if you’re not bothered about transactional changes to the data (a big if!) - you could put the different tables on different servers and have your app direct queries appropriately.

Also, especially if you’re expecting a very high number of requests to be made to change the data, you’re going to have to be careful if you’re running multiple separate statements that objects haven’t been merged by one process in the middle of your current attempt to merge them, etc. I think this could happen with any DB design though, if requests that can merge data can be made in parallel.

As a disclaimer on all of the above - I’ve never worked on something exactly like your problem before although I have done some similarish things. My approach each time is to setup a prototype, benchmark, observe, tweak, re-benchmark etc. Eventually I’m either happy or if not I try something different or, if I can’t think of a better approach, I look into scaling out the db platform or upgrading the db server, depending on how things look.

HTH anyway!
Toasty

PS If anyone else would like to join in with some thoughts on this I’d be interested to hear them

Toasty,

Please let me thank you once again for your constructive and elaborate feedback!

I have to admit that the numbers I mentioned have been elevated to account for future extension of the data set. One of the goals we are trying to achieve is the scalability of the solution. In this light, the solution with a number of primitive mapping tables seems to have much better “horizontal partitioning” possibilities as compared to having a single alias table, for example.

The other side of the story that you emphasize in the reply is the hardware limitiations. The application I’m working on has to run on a relatively weak hardware (2Gb of RAM, 1 Xeon CPU, SATA hard drive, Linux). Given that, the major design consideration is to be able to distribute data across a little “farm” of these weak computers.

Ideally, I wish MyISAM API would be documented and available for direct use bypassing the SQL engine part…

Cheers,
/Sergey