Lol How Do I RAID

Version 1.1.0
2024 eomanis
PGP signature

How to create an encrypted software RAID 6 on Linux

For cold storage (lots of data that, once written, does not change much, but may be read now and then)

A guide that gets shit done, along with some dos and don'ts with whys and whyn'ts

There are two steps that take a long time to complete – more than one day when using a similar number of disks of comparable capacity – during which the PC needs to be kept running; those steps are marked with an asterisk (*)

Notable required knowledge

How decimal prefixes work (kB, MB, GB…, 1000 * 1000 * 1000…)
How binary prefixes work (KiB, MiB, GiB…, 1024 * 1024 * 1024…)
Basic math, e.g. ((35 * 12) - (23 * 3)) / 3 = 117
If explanations are desired as to what the example commands and arguments do:
How to read user manuals with the "man" command (man <command>)
- /keyword<Enter>: Find the keyword in the manual
- n: Go to the next match
- N: Go to the preceding match

The disks

Here are our 6 20TB disks that will be assembled to a RAID 6, on top of which we'll put a dm-crypt volume, on top of which we'll create an XFS file system that is aligned to the RAID geometry for optimal performance:

[root@the-server ~]# ls -l /dev/disk/by-id | grep ata-TOSHIBA_

lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXUXXXX -> ../../sde
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXOXXXX -> ../../sdb
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXFXXXX -> ../../sdf
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXAXXXX -> ../../sdc
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXXIXXXX -> ../../sda
lrwxrwxrwx 1 root root  9 21. Feb 10:24 ata-TOSHIBA_MG10ACA20TE_XXXXXXX6XXXX -> ../../sdd

RAID creation

We create a RAID 6 over the 6 disks, which yields the effective size of 4 disks with the remaining space of 2 disks being used for parity/recovery information

Important: Concerning the space used by the RAID on each member disk, we must make sure to stay below 20TB so that we can be sure that any "20TB" disk can be used as replacement for a broken one

This is necessary because hard disk models' exact capacities may vary slightly above their advertised size, and if a replacement disk is of a different model that is just a tiny bit smaller than these ones it cannot be used in the RAID array if it was created using all available disk space, and that would just ruin our day

Create the RAID on the raw, unpartitioned disks

Some guides recommend partitioning the disks for the sole reason of restricting the space that the RAID uses on each member disk, but that is bad practice, and here's why (skip over the bullet points if you don't care why):

It introduces unnecessary complexity, both in data structures on the disks themselves and for the user who always has to remember to work with the partitions instead of the whole disks, therefore making working with the RAID more error-prone
It clutters up the /dev and /dev/disk/by-* directories with a bunch of symbolic links to said partitions
When we later replace a faulty disk with a new one we'll have to remember to partition that new disk first if we do not want to end up with Frankenstein's RAID cobbled together from a confusing mix of partitions and raw disks
The intended purpose of partitions is to make it possible to have multiple file systems on a single disk and not to rein in RAID member size; sure, you can use them for that, but it is like combing your hair with a fork – it works, but is not a fork's intended purpose, and there are better tools for that such as, well, a comb
The better, simpler, more elegant, made-for-that-exact-purpose tool for limiting RAID member size is right there: The mdadm --size=… option, which we will use instead

Calculating mdadm --size=

Intention: We only want to use the first 19999GB of each disk, so that on a disk that is exactly 20TB a single GB remains unused at its end

But the --size= parameter assumes binary prefixes (KiB, MiB, GiB…), not decimal prefixes (kB, MB, GB…)

So, we need to convert those 19999GB to GiB. Full calculation as described:

((20*1000*1000*1000*1000) - (1*1000*1000*1000)) / 1024 / 1024 / 1024
((20TB in bytes         ) - (1GB in bytes    )) -> KiB -> MiB -> GiB
 = 18625GiB (rounded down to whole GiB)

Then again, who cares about another unused GiB per disk; better safe than sorry, so here we use 18624GiB

[root@the-server ~]# mdadm --create /dev/md/raid --verbose --homehost=the-server --name=raid --raid-devices=6 --size=18624G --level=6 --bitmap=internal /dev/sd[abcdef]

mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/raid started.

In retrospect the order of disks could have been given according to ascending serial number

Review detailed RAID information

[root@the-server ~]# mdadm --detail /dev/md/raid

/dev/md/raid:
           Version : 1.2
     Creation Time : Fri Feb 21 11:04:42 2024
        Raid Level : raid6
        Array Size : 78114717696 (72.75 TiB 79.99 TB)
     Used Dev Size : 19528679424 (18.19 TiB 20.00 TB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Feb 21 11:06:14 2024
             State : clean, resyncing 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

     Resync Status : 0% complete

              Name : the-server:raid
              UUID : 8b76d832:6ca64e3a:af5bc01e:1471551a
            Events : 20

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf

For the subsequent steps the chunk size of 512KiB is of interest

Wait until the RAID has synced*

Check the sync progress by fetching the RAID status

[root@the-server ~]# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0]
      78114717696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  resync =  0.2% (54722848/19528679424) finish=1603.2min speed=202445K/sec
      bitmap: 146/146 pages [584KB], 65536KB chunk

unused devices: <none>

Here we can also see that the actual RAID block device is /dev/md127; we need this later

Create the encryption layer on the RAID

We use dm-crypt for encryption and create a LUKS2 volume

This particular dm-crypt volume will be unlocked automatically on startup with the crypttab mechanism using a key file that we will create and add a slot for later

But we also want to be able to unlock the volume on its own in an emergency with a regular passphrase, and we create it with this passphrase now

Use key slot 1 for the passphrase because key slot 0 will be used for the key file, so that in the regular use case (automatic unlocking at system startup) unlocking is quicker because slot 0 will be tried first

Now this here is the first step where so-called "stripe alignment" must be considered

What even is a RAID stripe?

Each RAID member disk is sectioned into chunks of 512KiB; this is the "chunk size" we have seen above in the detailed RAID view

The RAID logic (parity information calculation and data recovery) operates on those chunks, always using 6 chunks, one from each member disk, from the same chunk index

This being a RAID 6, for the same chunk index 4 chunks contain effective data and the remaining 2 chunks hold the parity information that enables the RAID logic to recover any two lost chunks in case one or two disks fail

A RAID stripe is the effective data that can be stored on such a group of 6 chunks

Now, consider: Altering some small amount of data in a single chunk invalidates the parity information of the two parity chunks in the same stripe, which must then be recalculated and updated on-disk, and on top of that, for parity recalculation the RAID logic needs the data from the other 3 data chunks, which it therefore has to read…

If you think "that sounds like it could be slow" you'd be right, which is why we want to only write whole stripes to the RAID if at all possible, because then the stripe's parity chunks can be calculated up front in memory without having to read anything from the RAID, and this is called "stripe alignment"

Make the LUKS2 data segment start at a stripe boundary

So, stripe alignment: Stripe width is (512KiB chunk size * 4 effective disks) = 2MiB

The dm-crypt LUKS2 default data segment offset is 16MiB, which is an exact multiple of 2MiB, so we are good as-is and do not need to specify a custom --offset at the next whole-stripe boundary beyond 16MiB

[root@the-server ~]# cryptsetup luksFormat --type luks2 --verify-passphrase --key-slot 1 --label srv.raid-80tb.encrypted /dev/md127

WARNING!
========
This will overwrite data on /dev/md127 irrevocably.

Are you sure? (Type 'yes' in capital letters): YES
Enter passphrase for /dev/md127:
Verify passphrase:

To give an example where we are not so lucky, let's say we'd have created the RAID 6 with one more disk, so 7 disks altogether, which yields the effective space of 5 disks

That would mean a stripe width of (512KiB chunk size * 5 effective disks) = 2560KiB = 2.5MiB, which 16MiB is not an exact multiple of: 16MiB / (512KiB * 5) = 6.4

Since we want the offset to be at least the default 16MiB our target offset for the data section would be at 7 times the stripe width, i.e. 7 * (512KiB * 5 effective disks) = 17920KiB

The --offset argument requires the offset to be supplied as number of 512B sectors, so we need to convert these KiB to sectors:

(17920 * 1024) / 512
(  KiB ->   B) ->  s
 = 35840s

Accordingly, we would add these two arguments to the command line: --offset 35840

Review encryption details

[root@the-server ~]# cryptsetup luksDump /dev/md127

LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           3d1b3de7-cc0c-4fe4-81e2-270d38966ef7
Label:          srv.raid-80tb.encrypted
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
    offset: 16777216 [bytes]
    length: (whole device)
    cipher: aes-xts-plain64
    sector: 4096 [bytes]

Keyslots:
  1: luks2
    Key:        512 bits
    Priority:   normal
    Cipher:     aes-xts-plain64
    Cipher key: 512 bits
    PBKDF:      argon2id
    Time cost:  12
    Memory:     1048576
    Threads:    4
    Salt:       dc 6b 2b 54 94 63 3a e7 1b f1 c4 c3 5e 43 00 f6
                fc 54 75 da f6 ba 7a 13 3e bb 72 b1 1d 7c 60 ba
    AF stripes: 4000
    AF hash:    sha256
    Area offset:32768 [bytes]
    Area length:258048 [bytes]
    Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
    Hash:       sha256
    Iterations: 332998
    Salt:       ae 91 73 42 d5 d6 ed b7 83 d5 f2 43 3b 18 04 87
                e2 40 26 23 80 e7 ae 7f a3 4f 20 d8 19 1c ab 9d
    Digest:     4f e6 a3 83 40 7a d4 65 24 84 dc 69 e4 f3 43 a7
                c2 2e 28 ee e2 94 7b 9d 4d b8 4e 96 14 aa 46 6a

The data segment offset is 16777216 bytes, which is indeed 16MiB:

16777216B / 1024 / 1024
    bytes -> KiB -> MiB
 = 16MiB

Unlock the encrypted device

[root@the-server ~]# cryptsetup open /dev/md127 srv.raid-80tb

Enter passphrase for /dev/md127:

Increase sustained write speed

Before we start using it for real we are going to fill the whole RAID with encrypted zeros, and for that we want to ensure that the RAID can write with its maximum possible speed

We do this by increasing the size of the stripe cache, because its default size usually bottlenecks the RAID write speed

What is the stripe cache?

The stripe cache is a reserved area in RAM where writes to a specific RAID are collected before they are written to the disks, with the goal of accumulating whole stripes that can then be written faster

It is set to a default size of 256 when the RAID is assembled, and can be changed while the RAID is online

…Holup, 256 what? Chunks, stripes, stripes-with-parity, MiB, bananas? How much RAM is that? Unfortunately official documentation seems scarce, but hearsay has it it's (size * memory page size * total RAID disk count), so with the usual page size of 4KiB that would be (256 * 4KiB * 6) = 6144KiB = 6MiB, which does seem small indeed

We'll crank it to 8192 which according to that formula will use (8192 * 4KiB * 6) = 196608KiB = 192MiB of memory

[root@the-server ~]# echo 8192 > /sys/class/block/md127/md/stripe_cache_size

Important: This setting is not persisted into the RAID configuration and must be set again each time after the RAID has been assembled

Typically you write an udev rule that does this and is triggered after the RAID has been assembled

Example: Text file "/etc/udev/rules.d/60-md-stripe-cache-size.rules"

# Set the RAID stripe cache size to 8192 for any RAID that is assembled on this system
SUBSYSTEM=="block", KERNEL=="md*", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}!="8192", ATTR{md/stripe_cache_size}="8192"

Overwrite the encrypted block device with zeros*

This causes the underlying RAID device to be filled with what looks like random data (the encrypted zeros), disguising how much space is actually used

[root@the-server ~]# dd if=/dev/zero iflag=fullblock of=/dev/mapper/srv.raid-80tb oflag=direct bs=128M status=progress

79989336702976 bytes (80 TB, 73 TiB) copied, 159750 s, 501 MB/s
dd: error writing '/dev/mapper/srv.raid-80tb': No space left on device
595968+0 records in
595967+0 records out
79989454143488 bytes (80 TB, 73 TiB) copied, 159751 s, 501 MB/s

Create an XFS file system

XFS (and other file systems too) can apply optimizations for underlying RAID chunks and stripes if they know about them, which is what we want

Fortunately for us, mkfs.xfs correctly detects the RAID's chunk size and stripe width automatically, even "through" the dm-crypt layer

[root@the-server ~]# mkfs.xfs -L raid-80tb -m bigtime=1,rmapbt=1 /dev/mapper/srv.raid-80tb

meta-data=/dev/mapper/srv.raid-80tb isize=512    agcount=73, agsize=268435328 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=19528675328, imaxpct=1
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

In the output, review the "data" section:

"bsize=4096": The file system's block size is 4096 bytes = 4KiB
"sunit=128": The "stripe unit" is 128 blocks
"swidth=512": The stripe width is 512 blocks

Does that match the RAID geometry, which is required for good performance?

The XFS "stripe unit" (sunit) is the RAID's "chunk size" as it is listed in the detailed RAID information (512KiB); they must be the same

Here it is expressed in blocks, so (128 * 4KiB block size) = 512KiB, same as the RAID chunk size, that tracks

As mentioned above, the stripe width is the stripe unit (RAID chunk size) multiplied by 4 effective disks, (512KiB * 4) = 2MiB

The mkfs.xfs output lists the stripe width (swidth) as 512 blocks, which is (512 * 4KiB block size) = 2MiB, so yes, we are good

Mount the file system

This does not require any fancy business with mount options, the defaults work fine

For example, to mount the XFS file system at "/mnt/raid-80tb" we'd do this:

[root@the-server ~]# mount /dev/mapper/srv.raid-80tb /mnt/raid-80tb

We can see the stripe unit (sunit) and stripe width (swidth) in the mount options when we look at the mounted file system:

[root@the-server ~]# mount | grep /mnt/raid-80tb

/dev/mapper/srv.raid-80tb on /mnt/raid-80tb type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=4096,noquota)

This time though they are expressed in number of 512 byte blocks, it says so in the "MOUNT OPTIONS" section of the XFS user manual

Stripe unit (chunk size):
(1024 * 512B) = 524288B = 512KiB, check
Stripe width:
(4096 * 512B) = 2097152B = 2048KiB = 2MiB, correct as well