Huge table concern

Hi,

I have a single table that stores the file information for all the users. Over a period of time, I see huge traffic of users on the website, and concerned about the size of the single table that would store all the file information. A user can have unlimited number of files, and thus eventually, the tabel would grow very huge and effect the performance.

I considered a manual partitioning. Users with ids from 1 to 50 will be allocated one table for file information. The other set of users from 51 to 100 will be allocated the other table for file information with same structure.
I am at a stage when the whole application is ready and any change would lead major impact on all the modules. And hence, do not want to follow this approach.
I have the server with the following configuration: “MYSQL Master - 2 x QUAD Core XEON E5410 - 300GB SAS - 2GB RAM”

Can somebody please guide me as to how I can start taking measures at this stage itself, so that we are prepared for the huge number of users well before?

To properly scale a system, you need to understand its expected throughput. For example, you should know what the expected average and worst-case values are for new users joining the system over time, users adding files to the system over time, etc. You should use this information to profile the current system without any changes to software, hardware or configuration, and determine how much time it buys you and plan your next steps from there.

Without some idea of your data sizes or usage patterns, its hard for me to suggest anything meaningful. You have to accept a few things: you need to put in work to measure and profile your system, otherwise you’re making decisions blindfolded; you may need to change your code, so start trying to reorganize it such that these changes are as localized as possible; your 2 GB of RAM will not last you very long, as you want as much data as possible to fit in memory.

I hope that helps.

Hi,

Thanks for the reply. I understand your point.

Regarding steps to measure the system, with one user as of now I see approximate 200 files. So, if I consider 200 users in the system, there are 40000 file records. And each month, there should be a raise by 100 users. So the file table will keep growing.

Can you please provide me some guideline, as to the approach of partitioning method I suggested in my first post, whether it is a good solution? Or can you let me know any better approach, which we should implement, since I understand after reading most of the performance posts, that partitioning is a good means which helps you load only the essential data that can fit in the memory. I am using MySql 5.0. Can you give me an insight on this?

Partitioning can help in a variety of ways. It’s at its best when there’s only one partition actively used. By this, I mean that recent or hot data, which is accessed often is kept in one or very few partitions and older data is essentially archived in the other partitions.

If your access patterns don’t match this scenario, partitioning can still be helpful if different partitions reside on different disks. This allows you to spread your IO around for greater throughput.

If you have one disk and you need to access all your partitions very often, then using partitioning will actually hurt your performance, and you’re better off with one large table, better hardware, or more machines.

So let’s take a look at a year out. You should have 1200 users and 240,000 file records. If your records are on average 1k in size, than your table is going to be at least 240 MB, which is no problem for a machine with 2 GB of RAM, even if that entire table is accessed very often. At a 10k row size, you’re starting to get into trouble with a 2.4 GB table, but if there’s enough data being accessed infrequently, and you don’t have any other large tables, you’re still ok. As the row size increases, you get into more and more trouble, but it really depends on how often these rows are accessed.

Don’t start with thinking about partitioning etc yet.

Start by looking at if you are going to have a problem or not.

Do as vgatto did by making estimations on small things that you can estimate and then perform calculations on these estimations to get an overall picture about what your database will look like in the future.
And in your case when looking at the figures that vgatto reached in his estimation, it sounds like it still is a pretty modest database.
Since even if we go with the larger row size of 10k your database is still only 2.4GB. And buying a lot more RAM than 2GB is still a very cheap solution compared to added work hours due to more complex solution and rework of the application.

The really basic rule is that if it’s possible you should have enough RAM on your server so that the database can be kept entirely in memory.
Because IO to disk is so much slower than accessing RAM that you can almost never go wrong as long as you have the database in memory and RAM are cheap these days.

After that you should take a quick look at the queries your application are issuing to the database and then check that these queries use indexes in a good way. Both for finding (WHERE) and for returning the records in the right order (ORDER BY).
The same query that uses indexes in a good way or not uses indexes at all can differ 1000 times or more in execution time and load on server.
That’s why I think this is the main step when looking at performance on a database, because none of the other things you look at will ever come near this in efficiency.

Hi,

Thanks a lot for your time, valuable suggestions and knowledge sharing.
As per the requirement of the table that stores the files and is eventually going to grow huge, the record size will be approximately 600 bytes, and this table will not be added columns any time in future, thus assuring that the row size will remain less than or equal to 600 bytes.

Hence, considering the statistical analysis provided by both of you, I would not consider the complex solution and rework in the entire application, and would instead consider increasing the RAM as the number of users increase.

Even if I consider, 1200 users storing 500 files each, that would be 600000 records. With 600 bytes per record, the table size would be 343MB, which can fit in the memory.

So next, I need to consider indexes. I have the fileId column in the table which is set as a primary key, so this, as I understand, is by default indexed. fileId is the only column frequently used when searching the table. So in your opinion, do you think, I need to set up any explict indexes?

It sounds like you can probably get by with only the primary key and no secondary indexes. Just be aware that any query not using an index will be forced to read the entire table, which even in the 300MB+ case will only take a few seconds if the table is already in memory. If you start seeing more and more of these unindexed queries, you can always add an index later, since it seems like you’ve got some extra memory and can probably afford it.

Thank you for the suggestion. The entire thread has been a very valuable discussion for me.