When RAID is barely AID
It has been my general experience in years past that RAID arrays seem to have a particular mystique about them. Firstly, most admins rely on them heavily, assuming that they are protected from the evils of hard drive failure. I used to be blindly trusting in RAID hardware controllers, particularly when using RAID level 5. I’ve come to the sad realization, however, that RAID is a mis-defined acronym. First, a little history…
RAID, as most would agree, stands for “Redundant Array of Inexpensive Disks”. I’ve heard some complaints from drive manufacturers recently about the use of the word “inexpensive”, as you can purchase very expensive drives to be placed in RAID arrays. The whole concept behind RAID was that we could store data on more than one drive so that we could sustain drive failure and, hopefully, not lose any data. We used inexpensive disks purely because were were cheap, and now that we had redundancy, we didn’t have to bet our jobs on it. These multiple drives showed up as an “array” or a group of drives to the host operating system, so they are managed as a single unit within the operating system. Cool…now I can buy a bunch of cheap disks, I don’t have to do anything fancy to get them to show up as one logical unit to my operating system, and a drive can go bad without causing data loss.
Admins ecstatically implemented RAID on many, if not all, of their servers. After all, why would you rely on only one of the component that is most likely to fail in your computer? For those who are new to that concept, you should look at your hard drives yearbook and see that they were indeed voted “Most likely to fail” by the rest of their class. The biggest question in most IT admin brains was, “What RAID level should I use?” Those already familiar with RAID levels will know that the most widely used RAID levels are probably 0, 1, 5, and 10, all for different reasons. To understand the failure of RAID in general, we have to understand a little about those levels.
RAID 0, at the bottom of the totem pole, shouldn’t be called RAID at all. Why? Simple, there is no redundancy of data. Data is split across all disks in the array with no regard for integrity. If one sector of one of the disks is unreadable, any file that is part of that sector is unreadable. If one disk fails or goes offline for any reason, the entire array is lost. For obvious reasons, this isn’t a good choice for business use. RAID 0 is, and should be, primarily limited to media editing systems where speed is paramount and all actual data is stored outside of the array (on stock video tapes, for example).
RAID 1 was devised as the first real RAID level. True redundancy is achieved by mirroring your data across two drives simultaneously. Now, without any hesitancy, the IT brave will yank a hard drive out of an array and gleam with pride as their server continues to run. I’ll discuss what happens to IT brave if it doesn’t run a little later on…
RAID 5 is next in numeric order. Here, the definition of how this level is achieved is somewhat complex. There is a 3 drive minimum for a RAID 5 array, and theoretically no maximum. The reason for the minimum is that data is written across two of the drives in the same way it would be if they were RAID 0′ed together. The special sauce comes into play on the third disk, to which “parity” information is written. No, parity is not a Catholic term, nor is it a politically correct term for “union”. Parity is a mathematicly-processed bit that can be thought of as the difference between the bits on the first two disks. Remember those Algebra days?
1 + 2 = 3
If I didn’t know the value of any one of those digits, I could calculate it mathematically. For example:
x + 2 = 3
Subtract 2 from both sides of the equation to give you x = 1
RAID 5 works in exactly the same way, running a simple equation to come up with a parity bit that is based on the two bits written to the first two drives. If I lose one of the drives in that array, for whatever reason, I can calculate the bit values for anything that was on that drive based on the parity and the remaining readable bit.
RAID 10, thought to be the only “enterprise-grade” RAID level for such a long time, creates it’s redundancy a bit differently. First, there is a 4 disk minimum for RAID 10. These disks are split into two groups. Each of the two disks in each group is RAID 1′ed together with the other disk in it’s group. After that is done, both groups of RAID 1 are RAID 0′ed together for greater speed and capacity. See, RAID 1 + RAID 0 = RAID 10. That’s not very funny…especially from someone who’s supposed to be good at math.
I’ve only covered the absolute basics in my discussion of RAID levels. There are many more details that completely explain how RAID controllers are able to achieve these bits of wizardry. My reason for explaining them is to introduce my beef with the actual implementation of RAID by our beloved hardware manufacturers.
Experience is a hard teacher. It rarely brings us “ahhhhh” moments without a preceding “ouch” moment. To date, I have run into 4 separate experiences where a RAID array I managed was anything but Redundant. In each of these instances, one drive of a RAID 5 array failed completely. Most whined, some kicked, but all were completely dysfunctional. In each case, given my explanation of RAID 5 above, my data shouldn’t have been harmed, right? If only the world ran at theoretical maximums all the time…
In each of those 4 experiences, the complete failure of one disk led to either the immediate failure of another or the failure of another within several seconds. Since these arrays were all RAID 5, two drives lost equals complete array failure. There simply aren’t enough known bits to re-calculate what the missing bits are. Two of those experiences I was able to recover data from, with one of them having to go to a data recovery specialist to revive. In the other two experiences, I was not successful in recovering any data at all, period.
Some may think it ironic that I would lose two drives basically simultaneously. I would as well, but it has happened to me now 4 times with different sets of hard drives (Seagate, Maxtor, Western Digital, Fujitsu), RAID controllers, and host systems. There is no commonality amongst the experiences…different physical locations, different hardware manufacturers (LSI Logic, Dell, HP, Adaptec, Broadcom/Ciprico), different operating systems (Suse Linux 9.0, OpenSuse 10.2, Windows Server 2003, Windows Server 2000). The situations couldn’t be more unique than they are, but nonetheless, all failed in the same way. Irony? No, I don’t think so. Curious? Yes, definitely. Conspiracy? Sure, if you’re into that sort of thing.
Maybe some hardware vendor who takes pity on me can explain to me why this is even possible. Back in the days when we were creating RAID arrays using IDE or SCSI disks, it would be more believable, as multiple disks shared one or more communications busses to the disk controller. It’s understandable that a physical or electrical interruption caused by one disk could hamper a disk controller from communicating with the other disks on the buss. In today’s SATA world, however, where each disk has it’s own buss, there should be no possibility for that kind of interruption.
Nevertheless, all 4 of my experience-building failures were on SATA/SAS-based systems. Something tells me the busses are not as discrete as the manufacturers claim. Even in situations where one or more disks could be brought back online, the RAID controllers all had to be coerced into bringing the entire array online so it could be used again. How can we, as IT professionals, be expected to believe any product is not capable of identical failures? If you are like me, you are suspicious of all RAID controllers and hard drives in general. You assume that you’re going to lose data so you create and maintain two backup copies of everything.
How about you? Can your disks kill bugs, or do they test HIV+?