Percona Server and Full Disk Encryption

Hi All,

I’ve hit a bit of a wall in benchmarking, and it’s not making much sense to me, so I’m hoping that someone here can help me.

My company is currently looking at the possibility of moving it’s primary databases into a full-disk encryption environment (AES-256 LUKS, likely); and I’m trying to benchmark FDE vs non-FDE to see what kind of performance hit we’ll be looking at. I was expecting some manner of hit, but what I’ve actually found is a hit so large as to be absolutely absurd, and I don’t know how to explain it away.

I’m using Percona Server 5.5.27-rel28.0-291.squeeze on Debian Linux 2.6.32-5-amd64, on two identical hosts (somewhat beefy Dell servers: dual Xeon E5620, 24GB RAM, SAS RAID-1 system block device, and SAS RAID-10 data block device). The systems are configured entirely the same, with the sole difference being the use of FDE on the block devices:

On the encrypted server:

/dev/sda holds the primary root partition (RAID-1), the MySQL binary log files (in /var/lib/mysqllogs), etc. This is all within an EXT3 filesystem on top of LVM which is in turn within an FDE block device.

/dev/sdc is the primary data block device (mounted at /var/lib/mysql). It is also a formatted EXT3 filesystem on top of LVM which is in turn within an FDE block device.

On the non-encrypted server:

/dev/sda holds the primary root partition (RAID-1), the MySQQL binary log files (in /var/lib/mysqllogs), etc. This is all within an EXT3 filesystem on top of LVM (no FDE).

/dev/sdc is the primary data block device (mounted at /var/lib/mysql). It is also formatted with an EXT3 filesystem on top of LVM (no FDE).

As you can see, the only difference, configuration wise, is that the encrypted serve has an FDE block device between LVM and the RAID device (hardware RAID device on these machines is a virtual drive provided by PERC6/i), whereas the unencrypted server has LVM sitting directly on the RAID device.

MySQL’s configuration is the same across the hosts, the filesystem creation and mount options, etc. Everything is the same:

[LIST]
[]RAID Strip Size: default as per the PERC6/i, 64kB
[
]Filesystem creation options: -t ext3 -b 4096 -E stride=16 -E stripe-width=32 -j -O large_fileFilesystem mount options: defaults,relatime,data=journal,barrier=0
[/LIST]

To benchmark the systems, I loaded a copy of production-like data, and then used mysqlslap to hit each server, with fairly low concurrency (30) and 5 iterations (though in further benchmarking I’m lowering to 3 iterations to save time).

Note that due to EXT3 being mounted as data=journal, MySQL’s O_DIRECT flush method isn’t available and both hosts are using flush method O_DSYNC.

Now, what I am seeing is that:
[LIST]
[]doing the initial data load on each system (zcat .sql.gz | time mysql --host=foo) took 1h45m on the unencrypted server and 2h35m on the encrypted server.
[
]the average for the queries that I’m throwing at the unencrypted server is 2494 seconds per iteration (so a total of 2494
5 = 12,470 seconds for the run of 5 iterations, or ~3.5 hours).
[]I aborted the benchmark run on the encrypted server before it completed, but after it had been running for 7h29m48s (!!!)
[
]On the encrypted server, in SHOW PROCESSLIST, I can see a lot (nearly all) of write queries that are in state “query end”.
[]The above was a run with innodb_flush_log_at_trx_commit = 1.
[
]After some research, I am rerunning the benchmark with innodb_flush_log_at_trx_commit = 2 in the hopes that it will help; but:
[*]While the unencrypted server shows the load of being benchmarked (i.e. 50-60% CPU use, high amounts of data read/written from disk with CPU I/O wait of ~2%); the encrypted server shows almost no load (~5-10% userspace CPU usage, ~5% CPU I/O wait, and very slow reading/writing from the disk). This was the case with innodb_flush_log_at_trx_commit = 1 as well (and I think that innodb_flush_log_at_trx_commit = 2 has helped increase/even out the load on the encrypted database server), but it still isn’t anywhere near equivalent; and just looking at the load numbers I can tell that the encrypted database server won’t correctly finish this benchmark in anything close to an adequate time.Just for reference, the encrypted database server still shows many (… all?) writes getting stalled in state “query end” for a significant amount of time.
[/LIST]

There is clearly a massive bottleneck between MySQL and the disk on the encrypted system, but I don’t know where it is (and, more importantly, if it can be worked around). Any help that can be given would be greatly appreciated!

Thanks,

Remi

As an example of the loads that I see on the hosts whilst benchmarking them. Here is dstat output from the unencrypted database server during a benchmark run:

----total-cpu-usage---- -dsk/total- -net/total- —paging-- —system–usr sys idl wai hiq siq| read writ| recv send| in out | int csw 32 9 52 4 0 4|1712k 63M|6851k 58M| 0 0 | 82k 156k 29 10 53 5 0 3|4956k 64M|7737k 63M| 0 0 | 81k 160k 35 8 49 5 0 3|4800k 60M|7089k 59M| 0 0 | 77k 142k 39 7 46 5 0 3|4308k 61M|6236k 59M| 0 0 | 74k 133k 30 9 54 4 0 3|1692k 67M|6964k 58M| 0 0 | 80k 162k 30 9 53 5 0 3|4248k 68M|7757k 52M| 0 0 | 79k 159k 28 8 57 4 0 3|1712k 62M|7550k 57M| 0 0 | 74k 140k 29 9 55 4 0 3|2072k 66M|6852k 58M| 0 0 | 79k 170k

… and here is the dstat output from the encrypted database server during the same benchmark run:

----total-cpu-usage---- -dsk/total- -net/total- —paging-- —system–usr sys idl wai hiq siq| read writ| recv send| in out | int csw 2 1 93 4 0 0| 68k 1888k| 394k 2162k| 0 0 |4838 9689 2 1 94 4 0 0| 72k 1840k| 367k 2071k| 0 0 |4634 9481 2 1 94 4 0 0| 60k 1740k| 350k 1885k| 0 0 |4533 8730 2 1 93 4 0 0| 64k 1928k| 315k 1705k| 0 0 |4499 9516 2 1 91 7 0 0| 68k 2696k| 344k 1865k| 0 0 |4594 9086 2 1 94 3 0 0| 68k 2072k| 349k 1792k| 0 0 |4860 9656