Researching Hardware

We’re beginning to work with a local hardware vendor to put together a basic computer to serve as the go-to replacement for any business workstation. The machine will have an Intel Atom processor, 4GB of memory, a 60GB SSD, and run Windows 7 Pro 64bit. The best news about this machine is that it has NO moving parts. With an entirely solid-state machine, there will be far less heat generated and absolutely no noise during any mode of operation. As well, since there are no moving parts, the useful life of the machine may be pushed out beyond 4 years depending on the use.

No more worrying about fans stopping, hard drives failing, or systems overheating. This box should let you simply do your work.

SSD’s are the Future

Solid State Drives aren’t a new technology. What is new about them is a recent acceptance into the enterprise market. Gone are the days when SSD’s were limited to experimental or military applications. Savvy IT admins are starting to use them in laptops, desktop, and even their beloved servers. What brought about this change?

Two key elements of SSD technology had to change radically from what they were a few years ago: price and write I/O performance. We’ll discuss the latter first, for absolutely no reason at all.

Most types of solid state data storage technologies have had traditionally poor write I/O performance. The already low performance dropped even lower over time and use, as sectors became unused. Those sectors require a “wipe” to clean them of any old data before you could write new data. Since operating systems still aren’t very intelligent about where on the drive they store data, this could lead to several wipe/verify/write operations any time data needed to be written to the storage device. This coupled with the already comparatively poor write throughput when pitted against spinning magnetic media, leaves the user on a strict I/O diet. SSD’s needed to leave South Beach and head for a land that allows for larger appetites.

In the past two years, several major advances have come forth in the write I/O department. First, there is a notion of wear leveling, which uses the storage controller present in a drive to write to sectors in a somewhat equal basis. This should cause every sector to be written to a similar number of times, which extends the life of each sector. In addition, we now have on-controller garbage collection, which performs the task of searching for abandoned sectors and performing a wipe/verify cycle on the sector. Doing this in advance of the need to write to the sector reduces the number of I/O’s from 3 to 1, which has an obvious impact when compared to a storage device with no garbage collection. As well, many manufacturers are, in essence, splitting up flash memory into several independent groups and internally RAID’ing them together in a RAID0 fashion. RAID0 doesn’t take much calculation or processing power, so it is easy for the storage controller to handle this task.

Pricing for solid state storage needed to drop as well. There was a reason I mentioned military applications previously, as they were generally the only entity that could afford to use solid state storage. That’s not the case today, thankfully, as manufacturers have streamlined and refined their manufacturing process to yield more output with less input. We now see even largish storage devices (250GB and up) available for under $1,000.

What, you say? That’s too expensive still? Consider this, then. If one 250GB SSD can deliver the same I/O performance as an 8 drive RAID 10 array, does that change your tune? How about if the same SSD could outperform a 10-16 drive RAID array? What then?

Reality is that SSD performance and price have reached a point where it is no longer financially justifiable to spend money on spinning magnetic media and all of the trimmings it requires. OK, plain old hard drives are still best in the quantity department, but that’s where they are destined to retire…just a lot of dumb, slow storage that people don’t even want to touch. Sounds like a DLT tape to me.

On a more personal level, we use SSD”s in many of our workstation systems already. Our first production server to use SSD’s is going into service this summer. Plans are in place to upgrade additional servers with SSD-based storage arrays. SSD’s are the future, so get on board.

When RAID is barely AID

It has been my general experience in years past that RAID arrays seem to have a particular mystique about them. Firstly, most admins rely on them heavily, assuming that they are protected from the evils of hard drive failure. I used to be blindly trusting in RAID hardware controllers, particularly when using RAID level 5. I’ve come to the sad realization, however, that RAID is a mis-defined acronym. First, a little history…

RAID, as most would agree, stands for “Redundant Array of Inexpensive Disks”. I’ve heard some complaints from drive manufacturers recently about the use of the word “inexpensive”, as you can purchase very expensive drives to be placed in RAID arrays. The whole concept behind RAID was that we could store data on more than one drive so that we could sustain drive failure and, hopefully, not lose any data. We used inexpensive disks purely because were were cheap, and now that we had redundancy, we didn’t have to bet our jobs on it. These multiple drives showed up as an “array” or a group of drives to the host operating system, so they are managed as a single unit within the operating system. Cool…now I can buy a bunch of cheap disks, I don’t have to do anything fancy to get them to show up as one logical unit to my operating system, and a drive can go bad without causing data loss.

Admins ecstatically implemented RAID on many, if not all, of their servers. After all, why would you rely on only one of the component that is most likely to fail in your computer? For those who are new to that concept, you should look at your hard drives yearbook and see that they were indeed voted “Most likely to fail” by the rest of their class. The biggest question in most IT admin brains was, “What RAID level should I use?” Those already familiar with RAID levels will know that the most widely used RAID levels are probably 0, 1, 5, and 10, all for different reasons. To understand the failure of RAID in general, we have to understand a little about those levels.

RAID 0, at the bottom of the totem pole, shouldn’t be called RAID at all. Why? Simple, there is no redundancy of data. Data is split across all disks in the array with no regard for integrity. If one sector of one of the disks is unreadable, any file that is part of that sector is unreadable. If one disk fails or goes offline for any reason, the entire array is lost. For obvious reasons, this isn’t a good choice for business use. RAID 0 is, and should be, primarily limited to media editing systems where speed is paramount and all actual data is stored outside of the array (on stock video tapes, for example).

RAID 1 was devised as the first real RAID level. True redundancy is achieved by mirroring your data across two drives simultaneously. Now, without any hesitancy, the IT brave will yank a hard drive out of an array and gleam with pride as their server continues to run. I’ll discuss what happens to IT brave if it doesn’t run a little later on…

RAID 5 is next in numeric order. Here, the definition of how this level is achieved is somewhat complex. There is a 3 drive minimum for a RAID 5 array, and theoretically no maximum. The reason for the minimum is that data is written across two of the drives in the same way it would be if they were RAID 0′ed together. The special sauce comes into play on the third disk, to which “parity” information is written. No, parity is not a Catholic term, nor is it a politically correct term for “union”. Parity is a mathematicly-processed bit that can be thought of as the difference between the bits on the first two disks. Remember those Algebra days?

1 + 2 = 3

If I didn’t know the value of any one of those digits, I could calculate it mathematically. For example:

x + 2 = 3
Subtract 2 from both sides of the equation to give you x = 1

RAID 5 works in exactly the same way, running a simple equation to come up with a parity bit that is based on the two bits written to the first two drives. If I lose one of the drives in that array, for whatever reason, I can calculate the bit values for anything that was on that drive based on the parity and the remaining readable bit.

RAID 10, thought to be the only “enterprise-grade” RAID level for such a long time, creates it’s redundancy a bit differently. First, there is a 4 disk minimum for RAID 10. These disks are split into two groups. Each of the two disks in each group is RAID 1′ed together with the other disk in it’s group. After that is done, both groups of RAID 1 are RAID 0′ed together for greater speed and capacity. See, RAID 1 + RAID 0 = RAID 10. That’s not very funny…especially from someone who’s supposed to be good at math.

I’ve only covered the absolute basics in my discussion of RAID levels. There are many more details that completely explain how RAID controllers are able to achieve these bits of wizardry. My reason for explaining them is to introduce my beef with the actual implementation of RAID by our beloved hardware manufacturers.

Experience is a hard teacher. It rarely brings us “ahhhhh” moments without a preceding “ouch” moment. To date, I have run into 4 separate experiences where a RAID array I managed was anything but Redundant. In each of these instances, one drive of a RAID 5 array failed completely. Most whined, some kicked, but all were completely dysfunctional. In each case, given my explanation of RAID 5 above, my data shouldn’t have been harmed, right? If only the world ran at theoretical maximums all the time…

In each of those 4 experiences, the complete failure of one disk led to either the immediate failure of another or the failure of another within several seconds. Since these arrays were all RAID 5, two drives lost equals complete array failure. There simply aren’t enough known bits to re-calculate what the missing bits are. Two of those experiences I was able to recover data from, with one of them having to go to a data recovery specialist to revive. In the other two experiences, I was not successful in recovering any data at all, period.

Some may think it ironic that I would lose two drives basically simultaneously. I would as well, but it has happened to me now 4 times with different sets of hard drives (Seagate, Maxtor, Western Digital, Fujitsu), RAID controllers, and host systems. There is no commonality amongst the experiences…different physical locations, different hardware manufacturers (LSI Logic, Dell, HP, Adaptec, Broadcom/Ciprico), different operating systems (Suse Linux 9.0, OpenSuse 10.2, Windows Server 2003, Windows Server 2000). The situations couldn’t be more unique than they are, but nonetheless, all failed in the same way. Irony? No, I don’t think so. Curious? Yes, definitely. Conspiracy? Sure, if you’re into that sort of thing.

Maybe some hardware vendor who takes pity on me can explain to me why this is even possible. Back in the days when we were creating RAID arrays using IDE or SCSI disks, it would be more believable, as multiple disks shared one or more communications busses to the disk controller. It’s understandable that a physical or electrical interruption caused by one disk could hamper a disk controller from communicating with the other disks on the buss. In today’s SATA world, however, where each disk has it’s own buss, there should be no possibility for that kind of interruption.

Nevertheless, all 4 of my experience-building failures were on SATA/SAS-based systems. Something tells me the busses are not as discrete as the manufacturers claim. Even in situations where one or more disks could be brought back online, the RAID controllers all had to be coerced into bringing the entire array online so it could be used again. How can we, as IT professionals, be expected to believe any product is not capable of identical failures? If you are like me, you are suspicious of all RAID controllers and hard drives in general. You assume that you’re going to lose data so you create and maintain two backup copies of everything.

How about you? Can your disks kill bugs, or do they test HIV+?

The little battery backup that couldn't

APC, close your eyes and cover your ears. I’m about to tell a little story about your products.

Once upon a time, there was a cute little battery backup unit. This little unit was rated at 750Watts worth of backup power. It’s owner was practically religious when it comes to having backup power, as power “events” happen frequently where he lives. The owner happily took the battery backup unit home and plugged it into the wall. Into the “battery protected outlets”, he plugged one computer and one LCD flat-panel monitor. He was so happy becuase he picked out a battery backup unit with more power rating than the rated draw of the devices he plugged into it. He thought, “There now, I’m safe from all the bad little power events.”

One day, when the owner was working on his computer, a bad power event hit his home. Instead of the comforting “I think I can, I think I can” coming from his battery backup for those 4 seconds, it died. The owner was perplexed. He had heard no warning toots of the horn, no flashing lights indicating overload, and the management software happily reported 9 minutes of runtime. The owner was not amused, for he received not even 9 seconds of runtime. Now, the owner was faced with paying the recycling company money to take his battery backup unit to the scrap yard.

Unfortunately, this story is not the first of it’s kind for the “owner”. I’ve had roughly a half-dozen of these things die on me in similar, if not identical, manners. Would it kill the manufacturers to put in a simple warning circuit that would beep when the load is too high for the battery to handle? How much, in component costs, could that possibly add to their cost in building their devices? How much would it add to consumer confidence in their products?

I wish I had a good alternative to purchasing these units, but I don’t want to buy the $200+ models to provide backup power to my cable modem. They’re too big and way too overpriced. I have about a dozen of the smaller units around my house and three of the 2200′s in my basement for test systems. It almost makes me want to buy a Liebert unit to drive my whole house, but that’d be a tad expensive.

I’ve tried other manufacturers than APC as well: Tripplite, Connext, and probably a few other no-namers. All of them suffered the same fate. I have a feeling they’re all made by the same company anyway.

It seems to me that this shouldn’t be that complicated of a problem. After all, I bought my little battery backup unit to get rid of my power problems, not introduce some.

Solutions for all of your IT needs

Mastermind's full range of integrated IT and hosting services insure every aspect of your technology needs are accounted for, allowing you to sit back and enjoy the view.

Contemplating the Cloud?

Moving all or part of your business operating infrastructure to a cloud network can lead to much lower operating costs, streamlined application deployment, and enhanced mobility.

A Stitch in Time...

...may save your business. Find out how our online backup and disaster recovery services can keep your important business data safe in the event of physical loss or damage.

Researching Hardware

SSD’s are the Future

When RAID is barely AID

The little battery backup that couldn't