See all this in action and calculate your own?
Please visit Storage Availability Calculator

Technical White Papers

Maximum Achievable Storage Array System Availability

by Nash R. Radovanovic

 

 

Are You Available?

 

What is availability? Or even harder question, what is High Availability (HA)?

Try searching the web and you will get my point pretty quick. Millions of matched web pages mention the phrase. Or so it appears. Vendors that have paid for web advertising appear on top of the list. Nonetheless, you may have gained what you haven't bargained for. An information clutter. So, what can you unearth about HA in the end? What is "High Availability"? Maybe the most patient of us, but majority will certainly give up before learning the true meaning of the phrase.

 

So many people use the HA words nowadays, that sometimes it makes you wonder if anyone actually knows what they mean. Quick analysis of the old scholar reveals that the word available means "present and ready for use". Furthermore, but not quite along the same research path, the word availability appears to be the member of the "ilities" family. No matter who you talk to, vendors and alike will tell you about their latest and greatest, technological silver bullet, for all your "ilities". Scalability, Availability, Serviceability, ility for this and that. Very few will mention any numbers to quantify the claim. Very few customers will know to define their "Highly Available" requirements anyhow. It is not uncommon to hear the veterans in the IT field refer to their solution as "Highly Available" and then expand on it, "from 9 to 5, Monday to Friday". High Availability (HA) as a concept is extremely simple, but for some reason, hard for most people to truly understand.

Highly Available or Hardly Available

 

So what it really means? High Availability phrase should be used to describe any system or technological solution that can successfully service it's business tasks more than 99% of the time during the course of the year. No ifs, buts or elses. To be mathematically correct and quantifiable, this translates into 3.65 days (or 87.6 hours) being maximum allowed for accumulated system down time or unavailability due to any reason, during the period of one year of service.

Text Box: 1 year = 365 days * 0.01 (1%) = 3.65 days x 24h = 87.6h

So, to reflect on the previously mentioned example, "9 to 5, Monday to Friday" does not qualify for anything remotely HA. This system can be characterized as "Business Critical" for a given time frame, but definitely not as Highly Available. Methods used to make sure that availability of the Business Critical system during the given time frame is ensured may be similar to methods used to achieve HA, but that should not be confused for justification to call the solution Highly Available. In order to achieve availability of the system during core business hours will be much easier and require much less resources than to achieve a truly HA system. Try to use HA attribute very carefully in your organization, as inappropriately used HA wording will generate some dangerous misconceptions. First, it may mislead management to believe that particular system is truly HA, when in fact it is not and second, it may again, mislead the whole organization into believing that achieving HA is easy and possible with limited resources.

Both misconceptions are dangerous and can result in negative consequences for both your organization and yourself when the first unplanned downtime hits you. To know what your allowable, cumulative, annual downtime is here's a handy table.

 

Availability

Allowable downtime (on annual basis)

99%

3.65 days /  87.6 hours / 5256 minutes

99.9% (one nine)

0.365 days / 8.76 hours /525.6 minutes

99.99% (two nines)

0.876 hours / 52.56 minutes

99.999% (three nines)

5.256 minutes / 315.36 seconds

99.9999% (four nines)

0.5256 minutes / 31.536 seconds

99.99999% (five nines)

3.1536 seconds

 

Storage Arrays

See all this in action and calculate your own?
Please visit Storage Availability Calculator

 


Why storage arrays? Because storage is at the heart of any Information Technology (IT) project. That's where the things that matter live. That's where you store your data and logic. That's what makes or breaks any solution. With the amount of data increasing almost exponentially within enterprise, the amount of required storage has catapulted into new highs as well. What used to be 1 MB, now is 1GB. What used to fit on a diskette, now can not fit on a CD. The whole data center capacity from 20 years ago now fits into the MP3 player or a memory stick for your camera. With such capacity requirements, the need for huge quantities of hard disks evolved. As with anything that comes in large numbers, both good and bad things became more visible. One inevitable thing that happens to any disk device is eventual death of the head or read/write mechanism. Generally speaking, the life expectancy of a single disk drive can be plotted to the chart in a form of a "bath tub" diagram. It got it's name, as you may have guessed, because of it's bath tub shape (see figure below). Majority of the mass produced devices will fail at the consistent rate during their expected useful life time. However their reliability will decrease significantly during the so called "burn-in" time period as well as when they reach the end of their expected life time. During those periods failure rate will change exponentially with time.

MTBF - May The Best Fail?

 

One significant number, commonly mentioned reflects the failure rate for devices during their expected life span. It is called Mean Time Between Failures or MTBF. This number is a constant (as depicted in our "bath tub" diagram with a straight horizontal line) and represents a number of hours that will pass before a single device will fail out of a group of 100,000 that are operating at the same time. MTBF is being expressed in hours and therefore it is directly implicated in any further calculations regarding the single device or group of devices reliability or availability. MTBF is a number usually quoted by the manufacturers of hard disks that should guide you, as the consumer, when making an acquisition decision.

Recently, the MTBF for any disk manufacturer has been around 500,000 hours for IDE drives and around 1,000,000 hours for SCSI drives. These numbers have also shown a huge improvement over 100,000 to 150,000 hours for same disks as recently as 10 years ago. With these technological improvements, it was possible of even thinking of creating a storage array of size that we see today with reasonable reliability and expectations of such system to be highly available.

MTTR - More Than True Revival!

 

The other number that will be mentioned later on is MTTR. This number is as important for calculating the overall system availability as it the MTBF, if not even more. Mean Time To Repair represents number of hours that your organization will require to replace failed device. Achieving true High Availability desperately depends on the quality and timing of the hardware support services. Number of hours required to notice, notify, dispatch and perform hardware replacement is so critical, that in fact your business depends on it even that you do not know it. Does this sound like news to you? Well, the worst part is yet to come. Failures are expected and probability of one can be calculated with certain confidence. Even that your calculations prove to be wrong, it will be a non-event. Well, nothing failed, will anyone notice? Probably not. On the other hand, predicting the MTTR depends on so many factors, with more than half of them being human. Hence, the unpredictability. And, of course, to make things even worse again, if you have miscalculated the MTTR, the event of a service person not fixing the failed drive in time can result in serious consequences. Not replacing a failed member of RAID, leaves you without that safety net and another disk failure in the MTTR given timeframe will bring your organization, or at least affected system down to its knees. You will still have to replace the failed drives, and go through the painful and long process of restoring your data from the backups. You do have backups, don’t' you?

Availability of the Storage Array

 

So after all said, how can one determine a true maximum availability for a given storage array? Let us try to discover the truth through an example. Let us examine the imaginative storage array consisting of anywhere between 2 and 8 individual disks. In addition to this, let us assume that we have the room for an optional ninth drive to act as a hardware-based, hot standby. Also, the drives within the array will be arranged in a hardware-based RAID level as implemented configuration. Not to exceed the scope of this article here is a brief description of RAID levels that will be discussed and used for further calculations.

 

RAID Level

Description

Valid Drive Number Combinations

RAID-0

Striping across all drives in array

2,3,4,5,6,7,8

RAID-1

Mirroring of drives in pairs

2,4,6,8

RAID-0+1

Striping, then mirroring in pairs

4,6,8

RAID-5

Striping with parity

3,4,5,6,7,8

RAID10

Mirroring in pairs, then striping

4,6,8

Description of the methods used

 

Two general scenarios are considered: "Without hot spare" and "With hot spare" indicating that the array will or will not have the available hot spare drive to substitute on the fly for the failed one.

 

Text Box: ! IMPORTANT NOTE: Please bear in mind that these calculations do not take in account recoverability of data, as in the case of RAID-0, the data in the RAID will be lost. Due to the nature of data striping, there are no means to recover the data on the lost disk. After the single disk failure, from the mathematical perspective the number of active disks is going be x% less, but from the business perspective, the loss is going to surely be 100%.

General assumption is that the RAID is handled by hardware means, hence there is no software reliability to be taken into account. Also the actual overall storage system availability will be influenced by many more factors, as controllers, power supplies, actual data links, switches, etc. These calculations take in account storage arrays from the disk drive perspective only.

 

Three tables to follow are based on the same disk assumptions (see assumptions below), but vary the Mean Time To Repair (MTTR) to emulate real life, data centre circumstances (24, 12 and 8 hours).

 

Mathematically speaking, a storage array consists of a number of devices, each having a known probability of an observed event (to fail). Depending on the RAID level being used to tie all the devices together, for a single purpose, the overall probability of a system failure will be a combination of dependent and/or independent events consisting of individual device or groups of devices. The formula for each RAID has been derived in the following paragraphs. Now, let's do some math.

Assumptions:

 

1.       Disk Mean Time Between Failures (MTBF) to be 1,000,000 hours (as per Seagate and IBM specifications for 18.2GB and 36.4GB SCSI or FC disk drives)

2.       Mean Time To Repair (MTTR) to be 24, 12 and 8 hours respectively, based on 8/4/2.5 hours to notice disk failure, 8/4/2.5 hours to dispatch the service technician to the site and 8/4/2.5 hours to replace, test and restore disk to working condition.

3.       ND (# of drives) is between the lowest possible for the given RAID and 8 + hot spare.

 

Formulae:

 

1.       Annual Failure Rate:

 

 

2.       Reliability:

 

 

3.       Annual Repair Rate:

 

 

4.       Availability:

 

 

5.       Maximum Achievable Array System Availability

 

·         RAID 0

 

 

All drives represented as independent (non-redundant) events.

 

·         RAID 1

 

       

 

All drives represented as drive groups (independent events) consisting of 2 drives each (dependent events). Number of drive groups is equal to half of all drives (ND / 2) as each drive is mirrored.

 

·         RAID 0+1

 

       

 

All drives represented as two drive groups (dependent events) consisting of half of all drives each (ND / 2) (independent events within group) as active group is mirrored to passive group.

 

·         RAID 5

 

 

All drives represented as two drive groups (dependent events) consisting of all but one and one (parity) drive each (ND - 1) (independent events within group).

 

 

·         RAID 10

 

 

All drives represented as ND / 2 drive groups (independent events) consisting of two drives each (source+mirror) (dependent events within group).

 

6.       Maximum Achievable Array System Availability calculations taking in account hot spare drive

 

 

Overall Conclusions

 

The maximum achievable availability for a storage array system with a MTTR of 8 hours is 99.99991% for a 2-disk RAID-1 configuration with hot spare (effectively having two extra disks for one active one).

As this is probably not acceptable in most real life scenarios, here, we will consider a more likely scenario.

 

Best results with full (8 disk + spare, with 8-hour MTTR) array are achieved by using RAID10 or RAID-1 (99.99964%), then RAID-5 (99.99937%), then RAID0+1 (99.99856%), and at the end RAID-0(*) (99.92527%).

Two major conclusions can be drawn from this data:

 

1)       It is not possible to achieve 99.9999% (six nines) availability with storage array as described regardless of the RAID level. In order to do so, it would be necessary to introduce more than one hot standby drive and decrease the MTTR by integrating the system monitoring tools, keeping the spare parts on site and technician on-call 24x7 with 1 hour or less of response time. Under such conditions, any RAID level, other than RAID-0 will achieve 99.99999% or higher availability.

2)       Decreasing the number of drives and increasing the number of hot standby drives in a storage array increases availability, considering all other factors to remain the same. The more redundancy you build in, the more available system you are going to achieve.

 

Popular 8 disk storage array on the market comes with it's out of the box configuration (RAID-5, 7+1+1 hot standby) can be measured as 99.999% available with MTTR between 8 and 24 hours.

 

 

See all this in action and calculate your own?
Please visit Storage Availability Calculator

 

 

Nash R. Radovanovic is an IT consultant working in Toronto, Ontario. Nash can be reached at nash at the domain bgdsoftware.com

 


Calculating the Storage Array System Availability for the 24 hour MTTR

 

Maximum Achievable Storage Array System Availability

Disk Data

Disk MTBF

1000000

h

 

 

 

Annual Failure Rate (AFR)

0.00876

 

 

 

 

Reliability (R)

99.12400%

 

 

 

 

Disk MTTR

24

h

 

 

 

Annual Repair Rate (ARR)

0.01148

 

 

 

 

Availability (A)

98.85243%

 

 

 

 

 

 

 

 

 

 

 

 

Without hot spare

 

# of drvs

RAID 0

RAID 1

RAID 0+1

RAID 5

RAID 10

 

 

2

97.70485%

99.98683%

 

 

 

 

 

3

96.55728%

 

 

99.97366%

 

 

 

4

95.40971%

99.97366%

99.94732%

99.96049%

99.97366%

 

 

5

94.26214%

 

 

99.94732%

 

 

 

6

93.11456%

99.96049%

99.88148%

99.93415%

99.96049%

 

 

7

91.96699%

 

 

99.92098%

 

 

 

8

90.81942%

99.94732%

99.78929%

99.90782%

99.94732%

With hot spare

 

# of drvs

RAID 0

RAID 1

RAID 0+1

RAID 5

RAID 10

 

 

2

97.70485%

99.99985%

 

 

 

 

 

3

96.55728%

 

 

99.99970%

 

 

 

4

95.40971%

99.99970%

99.99940%

99.99955%

99.99970%

 

 

5

94.26214%

 

 

99.99940%

 

 

 

6

93.11456%

99.99955%

99.99864%

99.99924%

99.99955%

 

 

7