After writing about decentralized data storage and torrents, I’ve had data storage on my mind. I drafted this post while setting up a RAID for the MeLa research group today, and built RAIDs for DDoSecrets a few weeks ago, so there’s a ton of “data storage” in my life right now and it only seems natural to quickly write up how large data storage on a centralized server works.
Hard drives get more expensive as storage capacity increases, and traditional spinning-plate hard drives have a limited lifespan, because spinning pieces and motors wear out relatively quickly from friction. Hard drives can read and write data more quickly when they spin faster, but this also wears out the drive more quickly. Therefore, getting a very big hard drive with a long lifespan that’s also fast becomes prohibitively expensive.
What if instead we could slap together several cheap and unreliable hard drives, and out of the harmony of garbage get high-capacity, reliable, high-speed data storage? A Redundant Array of Inexpensive Disks, or RAID, does exactly this. But how?
The most obvious way to “combine” two hard drives is to (metaphorically) glue one to the end of another. When the first hard drive is filled, start writing data to the second. We can create a virtual hard drive consisting of two or more physical hard drives glued together in this way, and write data to the virtual drive as if it’s one big drive.
Alright, that’s easy, but it’s lacking in a few ways. First, there’s no redundancy: If one of the cheap hard drives in our stack of three fails, we’ve just lost a third of our data. Second, it seems slow: Accessing most files will only use one hard drive at a time, and couldn’t we get more performance by using multiple drives at once?
A striped RAID works the same way as a linear RAID, but it splits data across all drives equally. If you had two drives then you’d put the even bytes on the first drive and the odd bytes on the second drive. Then when reading or writing data you use both drives at once, for half as long, and so in theory get twice the speed!
In the real world we don’t write “even and odd bytes” but rather “even and odd chunks of bytes” called stripes, because that’s faster in practice. Same idea.
Redundancy is still a problem with stripes, perhaps even more than with linear RAIDs: If a hard drive dies we now lose “all the even chunks of every file”, which makes our remaining data just about worthless.
Mirroring creates a perfect backup of a drive. Every time we write data to one drive, we also write the data to all the backup drives. If one of the drives dies, we seamlessly continue using the rest. When we replace a dead drive, we copy all the data from the other drives in the mirror.
When reading data we can actually get performance similar to striping, by reading some chunks from one drive while reading other chunks from a second drive. Only data writes need to be synchronized across all the drives. Mirrors limit you to the size of a single drive (if you have three 1-TB drives that are all perfect copies of one another, you only get one terabyte of “real storage”), and the write speed of a single drive, but the combined read speed of all your drives.
If we have four hard-drives we can easily combine last two strategies: Create two sets of mirrors, and stripe data across them.
We get the storage capacity of two out of the four drives, the write speed of two drives, the read speed of four drives, and redundancy. The redundancy is slightly unintuitive: We lose nothing if any one drive fails, and we lose nothing if a second drive fails and it wasn’t the mirrored copy of the first drive that failed. In other words, as long as we’re lucky and we still have a full copy of the data across some combination of drives, then we’re okay.
With more hard drives comes more flexibility. Six hard drives can be organized as three mirrors of two drives each, or two mirrors of three drives each. The administrator chooses a trade off between more storage, more speed, and more redundancy.
Don’t do this. If we reverse the order, striping two drives together and then mirroring the data to another two drives, we conceptually get the same result as a RAID 10. In practice however, a RAID 01 is more fragile. In most implementations, when one half of a stripe fails, the other half is disabled, too. Critical metadata, which tracks what stripes were allocated and placed on which drives, was kept on the now-dead drive, shattering the puzzle. Therefore when one drive in a RAID 01 fails, its striped partner also shuts down, reducing the 01 to a RAID 0. Don’t use 01, use 10.
A RAID 5 distributes chunks across three or more drives so that any one drive can be lost without losing data. For example, assuming we have three drives, we can store even chunks on drive 1, odd chunks on drive 2, and the XOR of the two chunks on drive 3. Given any two drives, the information on the third drive can be re-created.
This lets us keep two thirds of the storage from our three drives, along with the read and write speed of two drives. Mostly a better trade off than a striped mirror! With four or more drives the dividends are even better, since we’ll get the storage capacity and read and write speed of three drives, then four, etc.
Same idea as RAID 5, but the parity blocks (the XOR of the data blocks) are written to at least two drives. This means two drives can fail without losing data, and a RAID 6 is only possible with at least four drives.
Alright, we covered RAID 0, 1, 5, and 6, what happened to 2, 3, and 4? They’re all poor designs that have been retired in favor of 5 and 6. Here’s a brief run-down:
Same idea as a RAID 5, except information is striped at the bit-level across all the drives, and the drives use Hamming codes to provide redundancy and error correction. This means that all the drives must spin in sync, so you can only access one file at a time, and reading and writing at a bit level makes the configuration slower.
Same as RAID 2, but stripes at a byte-level instead of bit, and stores XORs of the bytes on the last drive, which is a dedicated “parity drive”. Again requires that all drives spin in sync.
Same as RAID 3, but stripes at a block-level instead of byte. This means read performance is much better (you can frequently read blocks for one file from two drives while reading blocks for another file from a third drive), but write performance is still poor, since parity blocks are all stored on a single drive, which must be used for all writes.
Traditionally to build a RAID you need a special “RAID card” on the computer, which connects to all the relevant drives and implements the RAID, presenting a single “virtual hard drive” to the motherboard and operating system. On more modern systems you can produce a “software RAID” where the operating system has access to the individual drives and produces a RAID on its own, using tools like mdadm or ZFS. This is sometimes more efficient, especially with ZFS, where the filesystem and RAID software are integrated and can read and write more efficiently than with a virtual disk.
Choosing the type of RAID you want is a decision about how much redundancy you need, versus capacity and speed. Many servers have multiple RAIDs for different purposes. One additional consideration is that most computers can only boot off of mirrored RAIDs. This is because the BIOS, the code burned into the motherboard that initializes enough of the hardware to find the operating system and start it, is very small and not so clever. Stripes and RAID 5 clusters are complicated, but a drive from a mirror can be treated like a single independent drive. The BIOS finds one drive in the mirror and uses it to start the operating system, which then realizes it’s on a mirrored RAID and picks up the other drives.
Therefore, one common server configuration is to use two or more SSDs in a mirrored RAID for booting. These drives contain the operating system and all software, can be read at an absolutely blazing speed, and have redundancy because of the mirror. Then additional conventional drives are placed in a RAID 5 or 6 for a decent trade on performance and capacity, creating a larger pool of drives for data.