MariaDB crashing

It looks like the server is crashing and restarting, but I’m having a problem figuring out why. It looks like I have something in my application that is repeatable to make it crash though, a particular report (this report is probably running some queries that are very long, but I doubt longer than 1GB). In the PHP error log, we see the MySQL server has gone away messages. In the MariaDB error log, I don’t see anything except the startup messages that appear a few seconds after the message in the PHP log. I don’t have any other information at this point to figure out what the cause is. We’re running MariaDB 10.3.28. This is my config file:

[mysqld]
auto_increment_increment=1
auto_increment_offset=1

performance-schema=0
slow-query-log=1
long-query-time=3
max_connections=2048

max_allowed_packet=1G

character-set-server=utf8
collation-server=utf8_general_ci
character-set-server=utf8
collation-server=utf8_general_ci
init-connect="SET NAMES utf8"
query_cache_type=1
query_cache_limit=32M
query_cache_size=512M

key_buffer_size=512M
sort_buffer_size=48M
read_buffer_size=48M
join_buffer_size=48M
read_rnd_buffer_size=48M
table_open_cache=8192
key-buffer-size=512M
tmp_table_size=512MB
myisam_sort_buffer_size=128M
bulk_insert_buffer_size=64M
myisam_max_sort_file_size=2G
myisam_repair_threads=2

tmpdir="/mysqltemp"

connect_timeout=10
wait_timeout=28800
interactive_timeout=1200
innodb_adaptive_hash_index=0

# Buffer pool size
innodb_buffer_pool_size=12G
innodb_flush_method=O_DIRECT
open_files_limit=40000
default-storage-engine=InnoDB
innodb_file_per_table=1

net_read_timeout = 120
net_write_timeout = 900

innodb_lock_wait_timeout=60
sql-mode="ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"

We increased the system RAM from 16GB to 32GB, I changed the InnoDB buffer pool from 12G to 24G, and the report finished. It looks like it was just running out of memory, so is there a way to have it fail more gracefully than crashing and restarting?

2 Likes

Hello @tcsteve,
Check dmesg and /var/log/messages for “OOM” which is your indication that the linux kernel killed mysql because it was using too much memory. This is a safety precaution that has been with linux for quite some time.

Many of your mysql config parameters are oversized. Remove all of these to restore their defaults:

key_buffer_size=512M
sort_buffer_size=48M
read_buffer_size=48M
join_buffer_size=48M
read_rnd_buffer_size=48M

Remove these as well if you are not using MyISAM tables (which you shouldn’t be):

myisam_sort_buffer_size=128M
bulk_insert_buffer_size=64M
myisam_max_sort_file_size=2G
myisam_repair_threads=2

If you had a 12GB buffer pool and only 16GB of RAM, yes, I can absolutely understand OOM happening. With 32GB of RAM, a 24GB buffer pool is a good setting.

Use this query to get a rough estimate of how much memory MySQL could possibly use:

SELECT CONCAT((@@key_buffer_size + @@query_cache_size + (@@innodb_buffer_pool_size * 1.05 + 20*1024*1024) + @@innodb_additional_mem_pool_size + @@innodb_log_buffer_size 
+ @@max_connections * (@@read_buffer_size + @@read_rnd_buffer_size + @@sort_buffer_size + @@join_buffer_size + @@binlog_cache_size + @@tmp_table_size
+ @@thread_stack)) / 1024/1024/1024, ' GB') AS "POTENTIAL MEMORY USAGE";
1 Like

I appreciate it, I’ll look into that. We’re trying to tune a VM, so the proper amount of RAM is one of the things we’re working on. I’ve only seen this behavior on the VM though, so I thought it might have been something else, although the other non-VM servers we have definitely each have a plethora of memory.

It looks like that’s the issue:

Mar 16 12:51:01 sv2 kernel: Out of memory: Kill process 110026 (mysqld) score 874 or sacrifice child
Mar 16 12:51:01 sv2 kernel: Killed process 110026 (mysqld), UID 27, total-vm:15870016kB, anon-rss:14179960kB, file-rss:0kB, shmem-rss:0kB

I’ll look into tuning the config like you suggest.

I’m just glad it didn’t make me sacrifice a child, that would have put a damper on my day.

1 Like

I had to remove the innodb_additional_mem_pool_size from that query, and it’s giving me 1056 GB as potential memory. That seems a touch high. I’ll try a stress test with 24GB instead and see how that goes.

1 Like

We’re doing some stress testing on this VM, which means running a lot of reports at the same time. This VM only has 4 CPUs, and during the test we’ll hit 3x or 4x CPU load. In each test, there are a few reports that end up running a lot longer than the others, even while the load is below capacity. If I use SHOW PROCESSLIST, I see one process that spends about a minute in the “storing results in query cache” state for several queries that it’s running, pretty consistently. I’ll paste the output of InnoDB status below, it looks like the buffer pools are doing fine but I’m not great at reading this, are there any indications of issues? One thing that might be relevant is that a single PHP process which makes a single database connection might end up running 10 different DB-intensive reports over 2 hours, and that all happens on a single connection.

=====================================
2021-03-18 14:35:31 0x7fd34e5bc700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 5 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 11521 srv_active, 0 srv_shutdown, 1905 srv_idle
srv_master_thread log flush and writes: 13426
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 8518
OS WAIT ARRAY INFO: signal count 6302
RW-shared spins 7619, rounds 175470, OS waits 3956
RW-excl spins 1703, rounds 22549, OS waits 485
RW-sx spins 3, rounds 90, OS waits 3
Spin rounds per wait: 23.03 RW-shared, 13.24 RW-excl, 30.00 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 3126702
Purge done for trx's n:o < 3126554 undo n:o < 0 state: running but idle
History list length 53
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 422047161361024, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161356840, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161352656, ACTIVE 30 sec
mysql tables in use 9, locked 0
0 lock struct(s), heap size 1128, 0 row lock(s)
MySQL thread id 73513, OS thread handle 140545528542976, query id 3170535 localhost cutest4_db Storing result in query cache
SELECT `cs`.*, `r`.`rating` AS `user_content_rating`, `asd`.`assign_date` AS `date_assigned`, `cp_cc`.`value` AS `content_cost` FROM (SELECT `cs`.*, `c`.`active` AS `content_active`, `c`.`title` AS `content_title`, `c`.`code` AS `content_code`, `c`.`type` AS `content_type`, `c`.`pass_score` AS `passing_score`, `c`.`pass_fail` AS `pass_fail`, `c`.`track_type` AS `track_type`, `c`.`time_to_complete` AS `time_to_complete`, `u`.`active` AS `user_active`, `u`.`fname` AS `first_name`, `u`.`lname` AS `last_name`, `u`.`username` AS `username`, `u`.`email` AS `email`, `u`.`field1` AS `field1`, `u`.`fie
Trx read view will not see trx with id >= 3126552, sees < 3126552
---TRANSACTION 422047161348472, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161453072, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161490728, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161494912, ACTIVE 0 sec
mysql tables in use 3, locked 0
0 lock struct(s), heap size 1128, 0 row lock(s)
MySQL thread id 55557, OS thread handle 140545522706176, query id 3173380 localhost cutest4_db Sending data
SELECT `id` FROM `users` WHERE `active` = 1 AND `id` IN (            SELECT `uid` FROM `usergroup_users` WHERE `ugid` IN ('83204')
           UNION
           SELECT `uid` FROM `autoassigned_users` WHERE `ugid` IN ('83204') )
Trx read view will not see trx with id >= 3126702, sees < 3126702
---TRANSACTION 422047161423784, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161419600, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161344288, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161335920, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 422047161340104, ACTIVE 0 sec starting index read
mysql tables in use 3, locked 0
0 lock struct(s), heap size 1128, 0 row lock(s)
MySQL thread id 47552, OS thread handle 140545531922176, query id 3173381 localhost cutest4_db Sending data
SELECT `id` FROM `users` WHERE `active` = 1 AND `id` IN (            SELECT `uid` FROM `usergroup_users` WHERE `ugid` IN ('83204')
           UNION
           SELECT `uid` FROM `autoassigned_users` WHERE `ugid` IN ('83204') )
Trx read view will not see trx with id >= 3126702, sees < 3126702
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
931800 OS file reads, 175741 OS file writes, 84287 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 7.00 writes/s, 4.40 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 2386, seg size 2388, 6 merges
merged operations:
insert 6, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
0.00 hash searches/s, 595406.52 non-hash searches/s
---
LOG
---
Log sequence number 81031611028
Log flushed up to   81031611028
Pages flushed up to 81031610041
Last checkpoint at  81031607285
0 pending log flushes, 0 pending chkp writes
50160 log i/o's done, 2.20 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 26344423424
Dictionary memory allocated 619344
Buffer pool size   1572864
Free buffers       641027
Database pages     931837
Old database pages 344132
Modified db pages  9
Percent of dirty pages(LRU & free pages): 0.001
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 1, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 931543, created 294, written 110267
0.00 reads/s, 0.00 creates/s, 3.80 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 931837, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size   196608
Free buffers       79487
Database pages     117121
Old database pages 43254
Modified db pages  2
Percent of dirty pages(LRU & free pages): 0.001
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 117112, created 9, written 18214
0.00 reads/s, 0.00 creates/s, 2.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 117121, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size   196608
Free buffers       83619
Database pages     112989
Old database pages 41726
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 1, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 112984, created 5, written 4284
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 112989, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size   196608
Free buffers       79925
Database pages     116683
Old database pages 43092
Modified db pages  4
Percent of dirty pages(LRU & free pages): 0.002
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 116683, created 0, written 13014
0.00 reads/s, 0.00 creates/s, 1.20 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 116683, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size   196608
Free buffers       79466
Database pages     117142
Old database pages 43261
Modified db pages  3
Percent of dirty pages(LRU & free pages): 0.002
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 117080, created 62, written 25879
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 117142, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 4
Buffer pool size   196608
Free buffers       78676
Database pages     117932
Old database pages 43553
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 117887, created 45, written 16458
0.00 reads/s, 0.00 creates/s, 0.20 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 117932, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 5
Buffer pool size   196608
Free buffers       79731
Database pages     116877
Old database pages 43163
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 116866, created 11, written 7393
0.00 reads/s, 0.00 creates/s, 0.20 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 116877, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 6
Buffer pool size   196608
Free buffers       79784
Database pages     116824
Old database pages 43144
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 116752, created 72, written 9105
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 116824, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 7
Buffer pool size   196608
Free buffers       80339
Database pages     116269
Old database pages 42939
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 116179, created 90, written 15920
0.00 reads/s, 0.00 creates/s, 0.20 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 116269, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
3 read views open inside InnoDB
Process ID=66438, Main thread ID=140545687009024, state: sleeping
Number of rows inserted 3691, updated 44314, deleted 16, read 35710780360
0.00 inserts/s, 1.80 updates/s, 0.00 deletes/s, 399721.26 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
1 Like

The results for specific reports are a little different between tests also. In the first test, a particular report took 19m 46s to finish, and in the current test the same report, which is also the one hung in the query cache state, has been going for 1h 14m and still going. Some of the reports which took a long time in the first test finished quicker in the second one, but in both I ended up with 3 or 4 reports running long after everything else had finished.

1 Like

It looks like query cache size is set to 512MB.

1 Like

I’ve killed the query cache and started another round of testing.

1 Like

The memory usage SQL I gave you calculates potential maximum based on max_connections. So, if you lower max_connections, the returned value will drop. Some mysql memory buffers are allocated globally, and some are allocated per-connection.

How are your queries? Have you ran EXPLAIN on them? Are they well indexed?

1 Like

Disabling the cache completely lead to generally longer times for most or all of the 46 reports in the batch. We eventually settled on 128MB cache, with a the number of simultaneous reports capped to not overload the CPUs.

The problem report was still a problem in the final run. In the first run, that report only took 19 minutes, and it took over 2 hours in the other runs, including the last. In the last one, when that final report started, the query cache had about 160k low memory prunes for the other 45 reports total, and after that final report finished the prunes were up almost 600k, with 1.4 million cache hits.

In general, the queries using indexing pretty well, but there is so much variation in the number of reports and the different options people use in them. I’m going to investigate that problem report to see what it’s running though.

In total, the InnoDB status page tells me that a single report queue run without the cache required 25 billion record reads, and with the cache it was 21.1 billion, so it’s making a difference.

Yeah, we do have a lot of max connections. The resources are set high to support both a lot of simultaneous users doing hardly any work each, with a small number of people running reports, but I need the same resource limits for all of them. But now that we’ve got a cap on the maximum number of tasks running reports at one time (right now it’s 8), that should help with resource usage.

1 Like

Just back to the original post, after doubling the RAM and InnoDB pool size, the database never crashed. We’ve been hammering it pretty hard, our biggest client has performed 71 billion reads since MySQL was restarted (maybe a week, maybe 2?), and each batch of our 5 tests was over 20 billion reads in about 2 and a half hours, so I’m mindful that we might be over-optimizing this.

1 Like

Sounds to me like your queries are not properly indexed. Run EXPLAIN on some of your queries and paste the result here.

1 Like