I made a web crawler that is split on 3 computers, each of them makes many db queries, I think around 200 queries per second (each server) non-stop (I might be wrong, I had to stop it for now).
I am using Centos 7.9 and Mysql 5.7 Community Server.
The database is around 40-60 GB on each server, all tables are InnoDB
Total records are 100 million urls split on 3 servers, 100 million “links”, 100 million url_meta.
13 million domains, etc.
Each table is around 15 GB on each server (links, urls, url meta, etc).
CPUs are Ryzen 5 3600 with 12 threads
64 GB ram on each server
1x GIGABYTE GP-ASM2NE6500GTTD 500GB (not “enterprise” type)
2x KINGSTON SEDC1000BM8480G “enterprise” nvme.
My biggest concern now is that the Gigabyte nvme shows as 30% wear, after just 3 months. The kingston enterprise ones says they are at 2%
raid command smartctl says I read around 10 TB and I wrote around 70 TB on each nvme.
I had my innodb_buffer_pool_size set to around 1 GB if I remember correctly, I increased it now and forgot previous value
The crawler constantly reads urls to crawl from the table of 30-40 million records on each server, sorts them by last crawl date ASC, reads remote url content, updates url_meta in database (title, description, etc). Updates links found in that url, link titles, etc.
This quickly makes the tables very fragmented and unless I run “optimize table”, they return queries very slowly.
I tried creating a copy of 2-3 most important tables and only update that once a week, and use it for reads, so it remains defragmented.
This is when I noticed the worn SSD, the 2 servers with enterprise Kingston nvme completed (copy +optimize tables) in 3 hours, the Gigabyte one in 9 hours.
A search query in the crawler is returned in 10 seconds using the “live” fragmented tables, vs around 0.2 seconds after tables are optimized/defragmented.
What should I do in order to optimize this and avoid destroying the nvmes ?
I am thinking eider:
- try a hardware setup with HDDs and only use the nvme SSDs for read-only cache. Do I have a chance to run all these queries from HDDs ?
- optimize all the caching options in order to write to disk as rarely as possible. Can I get tips on this please ?
- just use a SSD with more TBW ?
For my second option … I am not familiar with the tuning options at all, which ones should I look into besides innodb_buffer_pool_size ? 32GB out of the 64 GB is a good start for this situation ?
And I seen there are some options that control how often the cached data is “flushed” / written to SSD ? Can I get some info on this please ?
Ideally I would like it to use the ram as much as possible and write very rarely to SSD.
Losing data is not a huge deal but I would lose time while crawling it again.
If I switch to HDDs because of all the write commands, would the 64 GB memory help ? Or will the queries become unusable slow ?
I seen a raid card with flash cache and HDDs is faster than HDDs alone, but raid card flash cache worns just like the SSDs, no ?!
I am kind of lost