How to configure and work with a terabyte-size table

I’m using MySQL 8.0. I have a fact table with about 3 billion rows, 200 bytes per row, for a total of about 600GB. MySQL 8.0 has been difficult to work with. I increased buffer sizes, disabled binary logging, partitioned the table, and used REDUNDANT storage (because compression was slowing down reads and writes).

This is a fact table containing stock prices. five dimensions, which form a unique primary key on the table. The dimensions of the primary key are:

  1. Data source (Bloomberg, Reuters, etc.)
  2. Time Interval (60 seconds, 15 minutes, 1 hour, etc)
  3. Market (NYSE, AMEX, etc)
  4. Security (IBM, MSFT, etc)
  5. UTF Timestamp (integer, seconds since 1/1/1970)

The table is partitioned along Data Source, Time Interval, Market, and Security (the first 4 of the 5 primary key dimensions). Data is typically read and written for a si8ngle data source / time interval / market / security at a time. Partitions are sized so that they are between about 400MB-600GB each. So, there are more partitions for 60 second price interval data vs. 1 hour data. All of the 24 hour prices fit in a single partition. There are about 1,000 partitions.

The table also has an indexed SERIAL sequence number column.

The table is used for analytics. It resides on a Raid 0 array of SSD’s with 8GB/sec of read and write bandwidth, on an 18-core high end desktop. This is a very fast system.

We also have a cluster of 16x Ubuntu servers, each with 16 Xeon cores. We’re open to moving MySQL to this cluster, but we’re not convinced that hardware is the limiting factor.

Full table scans and random accesses are very efficient. Although MySQL doesn’t support parallel queries, I can get good parallel read performance by running multiple clients.

Although the number of partitions is high, we found that fewer, larger partitions were even more problematic.

The only problem is that deletes are ridiculously slow, even when deleting a single row, i.e.:

DELETE FROM STOCKPRICES
WHERE DATASRC =‘BLOOMBERG’
AND INTERVAL_SECS = ‘60’
AND EXCHG =‘NYSE’
AND TICKER =‘IBM’
AND PRICE_TIMESTAMP =1234567890;

It seems that MYSQL doesn’t use indexes for deletes. Deleting a single row by value takes 7 hours which includes a full table scan. However, a SELECT that performs a full table scan only takes 15 minutes. So, something else must be happening.

The table is partitioned INNODB, REDUNDANT format (we disabled compression because our disks / CPU / IO channels are VERY fast. compression was much slower because it throttled on CPU). We use FILE_PER_TABLE (i.e. 1 file per partition)

We have a price fetcher process that loads prices and updates this table about once per hour. 95% of the price-fetcher commands are INSERTS, and about 5% are UPDATES. But, we occasionally have DELETE’s.

We have a few questions for the community:

  1. Has anyone worked with table >1TB in MYSQL 8.0? What is the best way to work with it?

  2. Any advice on partition size? Would we be better off with fewer, larger partitions?

  3. MYSQL 8.0 disabled partitioned MYISAM; would we be better off downgrading to 5.6 or 5.7 with partitioned MYISAM tables?

  4. Do the Percona MYSQL extensions work better with Terabyte+ sized tables? If so which extensions?

  5. Are there better open source solutions for such a large table (such as MongoDB / Spark SQL / etc)?

Thanks in advance.

Loscalzo,

Indeed the workload you describe might be problematic for MySQL.
Actually for what you describe - historical data of stock prices and run analytics over it, I saw multiple successful deployments of ClickHouse. You may want to take a look - [URL]Fast Open-Source OLAP DBMS - ClickHouse