Version 1.1.0
2024 eomanis
PGP signature
How to create an encrypted software RAID 6 on Linux
For cold storage (lots of data that, once written, does not change much, but may be read now and then)
A guide that gets shit done, along with some dos and don'ts with whys and whyn'ts
There are two steps that take a long time to complete – more than one day when using a similar number of disks of comparable capacity – during which the PC needs to be kept running; those steps are marked with an asterisk (*)
Here are our 6 20TB disks that will be assembled to a RAID 6, on top of which we'll put a dm-crypt volume, on top of which we'll create an XFS file system that is aligned to the RAID geometry for optimal performance:
[root@the-server ~]# ls -l /dev/disk/by-id | grep ata-TOSHIBA_
lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXUXXXX -> ../../sde lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXOXXXX -> ../../sdb lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXFXXXX -> ../../sdf lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXAXXXX -> ../../sdc lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXIXXXX -> ../../sda lrwxrwxrwx 1 root root 9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXX6XXXX -> ../../sdd
We create a RAID 6 over the 6 disks, which yields the effective size of 4 disks with the remaining space of 2 disks being used for parity/recovery information
Important: Concerning the space used by the RAID on each member disk, we must make sure to stay below 20TB so that we can be sure that any "20TB" disk can be used as replacement for a broken one
This is necessary because hard disk models' exact capacities may vary slightly above their advertised size, and if a replacement disk is of a different model that is just a tiny bit smaller than these ones it cannot be used in the RAID array if it was created using all available disk space, and that would just ruin our day
Create the RAID on the raw, unpartitioned disks
Some guides recommend partitioning the disks for the sole reason of restricting the space that the RAID uses on each member disk, but that is bad practice, and here's why (skip over the bullet points if you don't care why):
Intention: We only want to use the first 19999GB of each disk, so that on a disk that is exactly 20TB a single GB remains unused at its end
But the --size= parameter assumes binary prefixes (KiB, MiB, GiB…), not decimal prefixes (kB, MB, GB…)
So, we need to convert those 19999GB to GiB. Full calculation as described:
((20*1000*1000*1000*1000) - (1*1000*1000*1000)) / 1024 / 1024 / 1024 ((20TB in bytes ) - (1GB in bytes )) -> KiB -> MiB -> GiB = 18625GiB (rounded down to whole GiB)
Then again, who cares about another unused GiB per disk; better safe than sorry, so here we use 18624GiB
[root@the-server ~]# mdadm --create /dev/md/raid --verbose --homehost=the-server --name=raid --raid-devices=6 --size=18624G --level=6 --bitmap=internal /dev/sd[abcdef]
mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: chunk size defaults to 512K mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md/raid started.
In retrospect the order of disks could have been given according to ascending serial number
[root@the-server ~]# mdadm --detail /dev/md/raid
/dev/md/raid: Version : 1.2 Creation Time : Fri Feb 21 11:04:42 2024 Raid Level : raid6 Array Size : 78114717696 (72.75 TiB 79.99 TB) Used Dev Size : 19528679424 (18.19 TiB 20.00 TB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Feb 21 11:06:14 2024 State : clean, resyncing Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Resync Status : 0% complete Name : the-server:raid UUID : 8b76d832:6ca64e3a:af5bc01e:1471551a Events : 20 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/sda 1 8 16 1 active sync /dev/sdb 2 8 32 2 active sync /dev/sdc 3 8 48 3 active sync /dev/sdd 4 8 64 4 active sync /dev/sde 5 8 80 5 active sync /dev/sdf
For the subsequent steps the chunk size of 512KiB is of interest
Check the sync progress by fetching the RAID status
[root@the-server ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] md127 : active raid6 sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0] 78114717696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU] [>....................] resync = 0.2% (54722848/19528679424) finish=1603.2min speed=202445K/sec bitmap: 146/146 pages [584KB], 65536KB chunk unused devices: <none>
Here we can also see that the actual RAID block device is /dev/md127; we need this later
We use dm-crypt for encryption and create a LUKS2 volume
This particular dm-crypt volume will be unlocked automatically on startup with the crypttab mechanism using a key file that we will create and add a slot for later
But we also want to be able to unlock the volume on its own in an emergency with a regular passphrase, and we create it with this passphrase now
Use key slot 1 for the passphrase because key slot 0 will be used for the key file, so that in the regular use case (automatic unlocking at system startup) unlocking is quicker because slot 0 will be tried first
Now this here is the first step where so-called "stripe alignment" must be considered
Each RAID member disk is sectioned into chunks of 512KiB; this is the "chunk size" we have seen above in the detailed RAID view
The RAID logic (parity information calculation and data recovery) operates on those chunks, always using 6 chunks, one from each member disk, from the same chunk index
This being a RAID 6, for the same chunk index 4 chunks contain effective data and the remaining 2 chunks hold the parity information that enables the RAID logic to recover any two lost chunks in case one or two disks fail
A RAID stripe is the effective data that can be stored on such a group of 6 chunks
Now, consider: Altering some small amount of data in a single chunk invalidates the parity information of the two parity chunks in the same stripe, which must then be recalculated and updated on-disk, and on top of that, for parity recalculation the RAID logic needs the data from the other 3 data chunks, which it therefore has to read…
If you think "that sounds like it could be slow" you'd be right, which is why we want to only write whole stripes to the RAID if at all possible, because then the stripe's parity chunks can be calculated up front in memory without having to read anything from the RAID, and this is called "stripe alignment"
So, stripe alignment: Stripe width is (512KiB chunk size * 4 effective disks) = 2MiB
The dm-crypt LUKS2 default data segment offset is 16MiB, which is an exact multiple of 2MiB, so we are good as-is and do not need to specify a custom --offset at the next whole-stripe boundary beyond 16MiB
[root@the-server ~]# cryptsetup luksFormat --type luks2 --verify-passphrase --key-slot 1 --label srv.raid-80tb.encrypted /dev/md127
WARNING! ======== This will overwrite data on /dev/md127 irrevocably. Are you sure? (Type 'yes' in capital letters): YES Enter passphrase for /dev/md127: Verify passphrase:
To give an example where we are not so lucky, let's say we'd have created the RAID 6 with one more disk, so 7 disks altogether, which yields the effective space of 5 disks
That would mean a stripe width of (512KiB chunk size * 5 effective disks) = 2560KiB = 2.5MiB, which 16MiB is not an exact multiple of: 16MiB / (512KiB * 5) = 6.4
Since we want the offset to be at least the default 16MiB our target offset for the data section would be at 7 times the stripe width, i.e. 7 * (512KiB * 5 effective disks) = 17920KiB
The --offset argument requires the offset to be supplied as number of 512B sectors, so we need to convert these KiB to sectors:
(17920 * 1024) / 512 ( KiB -> B) -> s = 35840s
Accordingly, we would add these two arguments to the command line: --offset 35840
[root@the-server ~]# cryptsetup luksDump /dev/md127
LUKS header information Version: 2 Epoch: 3 Metadata area: 16384 [bytes] Keyslots area: 16744448 [bytes] UUID: 3d1b3de7-cc0c-4fe4-81e2-270d38966ef7 Label: srv.raid-80tb.encrypted Subsystem: (no subsystem) Flags: (no flags) Data segments: 0: crypt offset: 16777216 [bytes] length: (whole device) cipher: aes-xts-plain64 sector: 4096 [bytes] Keyslots: 1: luks2 Key: 512 bits Priority: normal Cipher: aes-xts-plain64 Cipher key: 512 bits PBKDF: argon2id Time cost: 12 Memory: 1048576 Threads: 4 Salt: dc 6b 2b 54 94 63 3a e7 1b f1 c4 c3 5e 43 00 f6 fc 54 75 da f6 ba 7a 13 3e bb 72 b1 1d 7c 60 ba AF stripes: 4000 AF hash: sha256 Area offset:32768 [bytes] Area length:258048 [bytes] Digest ID: 0 Tokens: Digests: 0: pbkdf2 Hash: sha256 Iterations: 332998 Salt: ae 91 73 42 d5 d6 ed b7 83 d5 f2 43 3b 18 04 87 e2 40 26 23 80 e7 ae 7f a3 4f 20 d8 19 1c ab 9d Digest: 4f e6 a3 83 40 7a d4 65 24 84 dc 69 e4 f3 43 a7 c2 2e 28 ee e2 94 7b 9d 4d b8 4e 96 14 aa 46 6a
The data segment offset is 16777216 bytes, which is indeed 16MiB:
16777216B / 1024 / 1024 bytes -> KiB -> MiB = 16MiB
[root@the-server ~]# cryptsetup open /dev/md127 srv.raid-80tb
Enter passphrase for /dev/md127:
Before we start using it for real we are going to fill the whole RAID with encrypted zeros, and for that we want to ensure that the RAID can write with its maximum possible speed
We do this by increasing the size of the stripe cache, because its default size usually bottlenecks the RAID write speed
The stripe cache is a reserved area in RAM where writes to a specific RAID are collected before they are written to the disks, with the goal of accumulating whole stripes that can then be written faster
It is set to a default size of 256 when the RAID is assembled, and can be changed while the RAID is online
…Holup, 256 what? Chunks, stripes, stripes-with-parity, MiB, bananas? How much RAM is that? Unfortunately official documentation seems scarce, but hearsay has it it's (size * memory page size * total RAID disk count), so with the usual page size of 4KiB that would be (256 * 4KiB * 6) = 6144KiB = 6MiB, which does seem small indeed
We'll crank it to 8192 which according to that formula will use (8192 * 4KiB * 6) = 196608KiB = 192MiB of memory
[root@the-server ~]# echo 8192 > /sys/class/block/md127/md/stripe_cache_size
Important: This setting is not persisted into the RAID configuration and must be set again each time after the RAID has been assembled
Typically you write an udev rule that does this and is triggered after the RAID has been assembled
Example: Text file "/etc/udev/rules.d/60-md-stripe-cache-size.rules"
# Set the RAID stripe cache size to 8192 for any RAID that is assembled on this system
SUBSYSTEM=="block", KERNEL=="md*", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}!="8192", ATTR{md/stripe_cache_size}="8192"
This causes the underlying RAID device to be filled with what looks like random data (the encrypted zeros), disguising how much space is actually used
[root@the-server ~]# dd if=/dev/zero iflag=fullblock of=/dev/mapper/srv.raid-80tb oflag=direct bs=128M status=progress
79989336702976 bytes (80 TB, 73 TiB) copied, 159750 s, 501 MB/s dd: error writing '/dev/mapper/srv.raid-80tb': No space left on device 595968+0 records in 595967+0 records out 79989454143488 bytes (80 TB, 73 TiB) copied, 159751 s, 501 MB/s
XFS (and other file systems too) can apply optimizations for underlying RAID chunks and stripes if they know about them, which is what we want
Fortunately for us, mkfs.xfs correctly detects the RAID's chunk size and stripe width automatically, even "through" the dm-crypt layer
[root@the-server ~]# mkfs.xfs -L raid-80tb -m bigtime=1,rmapbt=1 /dev/mapper/srv.raid-80tb
meta-data=/dev/mapper/srv.raid-80tb isize=512 agcount=73, agsize=268435328 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=19528675328, imaxpct=1 = sunit=128 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
In the output, review the "data" section:
Does that match the RAID geometry, which is required for good performance?
The XFS "stripe unit" (sunit) is the RAID's "chunk size" as it is listed in the detailed RAID information (512KiB); they must be the same
Here it is expressed in blocks, so (128 * 4KiB block size) = 512KiB, same as the RAID chunk size, that tracks
As mentioned above, the stripe width is the stripe unit (RAID chunk size) multiplied by 4 effective disks, (512KiB * 4) = 2MiB
The mkfs.xfs output lists the stripe width (swidth) as 512 blocks, which is (512 * 4KiB block size) = 2MiB, so yes, we are good
This does not require any fancy business with mount options, the defaults work fine
For example, to mount the XFS file system at "/mnt/raid-80tb" we'd do this:
[root@the-server ~]# mount /dev/mapper/srv.raid-80tb /mnt/raid-80tb
We can see the stripe unit (sunit) and stripe width (swidth) in the mount options when we look at the mounted file system:
[root@the-server ~]# mount | grep /mnt/raid-80tb
/dev/mapper/srv.raid-80tb on /mnt/raid-80tb type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=4096,noquota)
This time though they are expressed in number of 512 byte blocks, it says so in the "MOUNT OPTIONS" section of the XFS user manual