Help, I am burning trough my NVME SSDs, 60GB MySQL databases (web crawler)

I made a web crawler that is split on 3 computers, each of them makes many db queries, I think around 200 queries per second (each server) non-stop (I might be wrong, I had to stop it for now).

I am using Centos 7.9 and Mysql 5.7 Community Server.
The database is around 40-60 GB on each server, all tables are InnoDB
Total records are 100 million urls split on 3 servers, 100 million “links”, 100 million url_meta.
13 million domains, etc.
Each table is around 15 GB on each server (links, urls, url meta, etc).
CPUs are Ryzen 5 3600 with 12 threads
64 GB ram on each server
nvme SSDs:
1x GIGABYTE GP-ASM2NE6500GTTD 500GB (not “enterprise” type)
2x KINGSTON SEDC1000BM8480G “enterprise” nvme.

My biggest concern now is that the Gigabyte nvme shows as 30% wear, after just 3 months. The kingston enterprise ones says they are at 2%

raid command smartctl says I read around 10 TB and I wrote around 70 TB on each nvme.

I had my innodb_buffer_pool_size set to around 1 GB if I remember correctly, I increased it now and forgot previous value :confused:

The crawler constantly reads urls to crawl from the table of 30-40 million records on each server, sorts them by last crawl date ASC, reads remote url content, updates url_meta in database (title, description, etc). Updates links found in that url, link titles, etc.

This quickly makes the tables very fragmented and unless I run “optimize table”, they return queries very slowly.

I tried creating a copy of 2-3 most important tables and only update that once a week, and use it for reads, so it remains defragmented.
This is when I noticed the worn SSD, the 2 servers with enterprise Kingston nvme completed (copy +optimize tables) in 3 hours, the Gigabyte one in 9 hours.

A search query in the crawler is returned in 10 seconds using the “live” fragmented tables, vs around 0.2 seconds after tables are optimized/defragmented.

What should I do in order to optimize this and avoid destroying the nvmes ?
I am thinking eider:

  1. try a hardware setup with HDDs and only use the nvme SSDs for read-only cache. Do I have a chance to run all these queries from HDDs ? :frowning:
  2. optimize all the caching options in order to write to disk as rarely as possible. Can I get tips on this please ?
  3. just use a SSD with more TBW ?

For my second option … I am not familiar with the tuning options at all, which ones should I look into besides innodb_buffer_pool_size ? 32GB out of the 64 GB is a good start for this situation ?
And I seen there are some options that control how often the cached data is “flushed” / written to SSD ? Can I get some info on this please ?
Ideally I would like it to use the ram as much as possible and write very rarely to SSD.
Losing data is not a huge deal but I would lose time while crawling it again.

If I switch to HDDs because of all the write commands, would the 64 GB memory help ? Or will the queries become unusable slow ?
I seen a raid card with flash cache and HDDs is faster than HDDs alone, but raid card flash cache worns just like the SSDs, no ?!

I am kind of lost :woozy_face:

1 Like

I had my innodb_buffer_pool_size set to around 1 GB

This is your main problem. The buffer pool holds both read-cache and is the staging area for writes. So, if this buffer is too small, you will lots of little writes when the pool needs to make room. The buffer pool should be 80-90% of your total system RAM, but not larger than your active dataset. With 64GB of RAM, I would set this to 56GB, leaving 8GB for OS/filesystem buffers.

Next, make sure innodb_log_file_size matches 1/2 of your writes MB/hr (yes, per hr).

A search query in the crawler is returned in 10 seconds

This sounds like you are not correctly indexed. Use EXPLAIN on all of your SELECTs to determine if you see ‘ALL’ anywhere in the results. If so, this is full-table scan and you need some indexes.

Can’t I edit the initial question ? After starting my cron jobs again, I seen it was actually 3k-4k queries per second non-stop, not 200 as I initially posted. This seems important.

I have 9 indexes (link_date, link_text, link_sharding, link_maintenance_date, etc). Tracking many things, maybe here is my problem.
The above indexes are for links table, this is most important query:

SELECT * FROM links WHERE match(link_text) against('blue socks' IN BOOLEAN MODE) LIMIT 2000

I increased the innodb_buffer_pool_size to 32 GB (out of 64 GB available), I have some memcached data too, I will keep an eye to see if I can increase more.

Explain for above query returned this:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE links NULL fulltext link_text link_text 0 const 1 100.00 Using where; Ft_hints: no_ranking, limit = 2000
1 Like

The more indexes you have the more WRITE activity will happen per INSERT/UPDATE/DELETE. Only have indexes for specific queries. If an index is not used by any queries, remove it. You can also change innodb_flush_log_at_trx_commit=2 to change flush to disk every 1s instead of on every commit.

Thanks, I will also try the innodb_flush_log_at_trx_commit=2,
Can I make it “flush to disk” even less often ? e.g every few minutes/hours ? If I lose data is not a big deal.

I added many indexes like for example link_domain so I quickly select all links to a given domain.
But this select runs in background, is not critical to be fast, but I thought it makes it go trough less data. Do you think I should NOT add indexes for similar situations ?

1 Like

You can set innodb_flush_log_at_timeout=X where X is number of seconds between flushes. Values between 1 and 2700 are allowed. I’ve never heard of anyone making this larger than 3 so you are in uncharted waters.

Indexes are important for reading. Without them, you’ll have to scan/read from disk more data than necessary. But indexes add writing overhead. It’s a balance you have to determine for yourself.

1 Like

I seen many tuning parameters contain log_ in them (like innodb_flush_log_at_timeout), do these strictly refer to saving/buffering the log files or it refers to actual data too ?

1 Like

actual data. “log” in this context refers to innodb’s redo logs.

2 Likes

I hope that’s my last question … :slight_smile: Can you tell me what software is used to create the actual chart images of sysbench benchmarks like attached ?

1 Like

I don’t know. I’ve never made graphs like that.

1 Like