by Nash R. Radovanovic
Are You Available?What is availability? Or even harder question, what is High Availability (HA)?
Try searching the web and you will get my point pretty quick. Millions of matched web pages mention the phrase. Or so it appears. Vendors that have paid for web advertising appear on top of the list. Nonetheless, you may have gained what you haven't bargained for. An information clutter. So, what can you unearth about HA in the end? What is "High Availability"? Maybe the most patient of us, but majority will certainly give up before learning the true meaning of the phrase.
So many people use the HA words nowadays, that sometimes it makes you wonder if anyone actually knows what they mean. Quick analysis of the old scholar reveals that the word available means "present and ready for use". Furthermore, but not quite along the same research path, the word availability appears to be the member of the "ilities" family. No matter who you talk to, vendors and alike will tell you about their latest and greatest, technological silver bullet, for all your "ilities". Scalability, Availability, Serviceability, ility for this and that. Very few will mention any numbers to quantify the claim. Very few customers will know to define their "Highly Available" requirements anyhow. It is not uncommon to hear the veterans in the IT field refer to their solution as "Highly Available" and then expand on it, "from 9 to 5, Monday to Friday". High Availability (HA) as a concept is extremely simple, but for some reason, hard for most people to truly understand.
So what it really means? High Availability phrase should be used to describe any system or technological solution that can successfully service it's business tasks more than 99% of the time during the course of the year. No ifs, buts or elses. To be mathematically correct and quantifiable, this translates into 3.65 days (or 87.6 hours) being maximum allowed for accumulated system down time or unavailability due to any reason, during the period of one year of service.
![]()
So, to reflect on the previously mentioned example, "9 to 5, Monday to Friday" does not qualify for anything remotely HA. This system can be characterized as "Business Critical" for a given time frame, but definitely not as Highly Available. Methods used to make sure that availability of the Business Critical system during the given time frame is ensured may be similar to methods used to achieve HA, but that should not be confused for justification to call the solution Highly Available. In order to achieve availability of the system during core business hours will be much easier and require much less resources than to achieve a truly HA system. Try to use HA attribute very carefully in your organization, as inappropriately used HA wording will generate some dangerous misconceptions. First, it may mislead management to believe that particular system is truly HA, when in fact it is not and second, it may again, mislead the whole organization into believing that achieving HA is easy and possible with limited resources.
Both misconceptions are dangerous and can result in negative consequences for both your organization and yourself when the first unplanned downtime hits you. To know what your allowable, cumulative, annual downtime is here's a handy table.
|
Availability |
Allowable downtime (on annual basis) |
|
99% |
3.65 days / 87.6 hours / 5256 minutes |
|
99.9% (one nine) |
0.365 days / 8.76 hours /525.6 minutes |
|
99.99% (two nines) |
0.876 hours / 52.56 minutes |
|
99.999% (three nines) |
5.256 minutes / 315.36 seconds |
|
99.9999% (four nines) |
0.5256 minutes / 31.536 seconds |
|
99.99999% (five nines) |
3.1536 seconds |
| See all this in action and calculate your own? Please visit Storage Availability Calculator |

Why storage arrays? Because storage is at the heart of any Information
Technology (IT) project. That's where the things that matter live. That's where
you store your data and logic. That's what makes or breaks any solution. With
the amount of data increasing almost exponentially within enterprise, the
amount of required storage has catapulted into new highs as well. What used to
be 1 MB, now is 1GB. What used to fit on a diskette, now can not fit on a CD.
The whole data center capacity from 20 years ago now fits into the MP3 player
or a memory stick for your camera. With such capacity requirements, the need for
huge quantities of hard disks evolved. As with anything that comes in large
numbers, both good and bad things became more visible. One inevitable thing
that happens to any disk device is eventual death of the head or read/write
mechanism. Generally speaking, the life expectancy of a single disk drive can
be plotted to the chart in a form of a "bath tub" diagram. It got
it's name, as you may have guessed, because of it's bath tub shape (see figure
below). Majority of the mass produced devices will fail at the consistent rate
during their expected useful life time. However their reliability will decrease
significantly during the so called "burn-in" time period as well as
when they reach the end of their expected life time. During those periods
failure rate will change exponentially with time.
One significant number, commonly mentioned reflects the failure rate for devices during their expected life span. It is called Mean Time Between Failures or MTBF. This number is a constant (as depicted in our "bath tub" diagram with a straight horizontal line) and represents a number of hours that will pass before a single device will fail out of a group of 100,000 that are operating at the same time. MTBF is being expressed in hours and therefore it is directly implicated in any further calculations regarding the single device or group of devices reliability or availability. MTBF is a number usually quoted by the manufacturers of hard disks that should guide you, as the consumer, when making an acquisition decision.
Recently, the MTBF for any disk manufacturer has been around 500,000 hours for IDE drives and around 1,000,000 hours for SCSI drives. These numbers have also shown a huge improvement over 100,000 to 150,000 hours for same disks as recently as 10 years ago. With these technological improvements, it was possible of even thinking of creating a storage array of size that we see today with reasonable reliability and expectations of such system to be highly available.
The other number that will be mentioned later on is MTTR. This number is as important for calculating the overall system availability as it the MTBF, if not even more. Mean Time To Repair represents number of hours that your organization will require to replace failed device. Achieving true High Availability desperately depends on the quality and timing of the hardware support services. Number of hours required to notice, notify, dispatch and perform hardware replacement is so critical, that in fact your business depends on it even that you do not know it. Does this sound like news to you? Well, the worst part is yet to come. Failures are expected and probability of one can be calculated with certain confidence. Even that your calculations prove to be wrong, it will be a non-event. Well, nothing failed, will anyone notice? Probably not. On the other hand, predicting the MTTR depends on so many factors, with more than half of them being human. Hence, the unpredictability. And, of course, to make things even worse again, if you have miscalculated the MTTR, the event of a service person not fixing the failed drive in time can result in serious consequences. Not replacing a failed member of RAID, leaves you without that safety net and another disk failure in the MTTR given timeframe will bring your organization, or at least affected system down to its knees. You will still have to replace the failed drives, and go through the painful and long process of restoring your data from the backups. You do have backups, don’t' you?
So after all said, how can one determine a true maximum availability for a given storage array? Let us try to discover the truth through an example. Let us examine the imaginative storage array consisting of anywhere between 2 and 8 individual disks. In addition to this, let us assume that we have the room for an optional ninth drive to act as a hardware-based, hot standby. Also, the drives within the array will be arranged in a hardware-based RAID level as implemented configuration. Not to exceed the scope of this article here is a brief description of RAID levels that will be discussed and used for further calculations.
|
RAID Level |
Description |
Valid Drive Number Combinations |
|
RAID-0 |
Striping across all drives in array |
2,3,4,5,6,7,8 |
|
RAID-1 |
Mirroring of drives in pairs |
2,4,6,8 |
|
RAID-0+1 |
Striping, then mirroring in pairs |
4,6,8 |
|
RAID-5 |
Striping with parity |
3,4,5,6,7,8 |
|
RAID10 |
Mirroring in pairs, then striping |
4,6,8 |
Two general scenarios are considered: "Without hot spare" and "With hot spare" indicating that the array will or will not have the available hot spare drive to substitute on the fly for the failed one.
General assumption is that the RAID is
handled by hardware means, hence there is no software reliability to be taken
into account. Also the actual overall storage system availability will be
influenced by many more factors, as controllers, power supplies, actual data
links, switches, etc. These calculations take in account storage arrays from
the disk drive perspective only.
Three tables to follow are based on the same disk assumptions (see assumptions below), but vary the Mean Time To Repair (MTTR) to emulate real life, data centre circumstances (24, 12 and 8 hours).
Mathematically speaking, a storage array consists of a number of devices, each having a known probability of an observed event (to fail). Depending on the RAID level being used to tie all the devices together, for a single purpose, the overall probability of a system failure will be a combination of dependent and/or independent events consisting of individual device or groups of devices. The formula for each RAID has been derived in the following paragraphs. Now, let's do some math.
1. Disk Mean Time Between Failures (MTBF) to be 1,000,000 hours (as per Seagate and IBM specifications for 18.2GB and 36.4GB SCSI or FC disk drives)
2. Mean Time To Repair (MTTR) to be 24, 12 and 8 hours respectively, based on 8/4/2.5 hours to notice disk failure, 8/4/2.5 hours to dispatch the service technician to the site and 8/4/2.5 hours to replace, test and restore disk to working condition.
3. ND (# of drives) is between the lowest possible for the given RAID and 8 + hot spare.
1. Annual Failure Rate:
![]()
2. Reliability:
![]()
3. Annual Repair Rate:
![]()
4. Availability:
![]()
5. Maximum Achievable Array System Availability
· RAID 0
![]()
All drives represented as independent (non-redundant) events.
· RAID 1
![]()
All drives represented as drive groups (independent events) consisting of 2 drives each (dependent events). Number of drive groups is equal to half of all drives (ND / 2) as each drive is mirrored.
· RAID 0+1

All drives represented as two drive groups (dependent events) consisting of half of all drives each (ND / 2) (independent events within group) as active group is mirrored to passive group.
· RAID 5
![]()
All drives represented as two drive groups (dependent events) consisting of all but one and one (parity) drive each (ND - 1) (independent events within group).
· RAID 10
![]()
All drives represented as ND / 2 drive groups (independent events) consisting of two drives each (source+mirror) (dependent events within group).
6. Maximum Achievable Array System Availability calculations taking in account hot spare drive
![]()
The maximum achievable availability for a storage array system with a MTTR of 8 hours is 99.99991% for a 2-disk RAID-1 configuration with hot spare (effectively having two extra disks for one active one).
As this is probably not acceptable in most real life scenarios, here, we will consider a more likely scenario.
Best results with full (8 disk + spare, with 8-hour MTTR) array are achieved by using RAID10 or RAID-1 (99.99964%), then RAID-5 (99.99937%), then RAID0+1 (99.99856%), and at the end RAID-0(*) (99.92527%).
Two major conclusions can be drawn from this data:
1) It is not possible to achieve 99.9999% (six nines) availability with storage array as described regardless of the RAID level. In order to do so, it would be necessary to introduce more than one hot standby drive and decrease the MTTR by integrating the system monitoring tools, keeping the spare parts on site and technician on-call 24x7 with 1 hour or less of response time. Under such conditions, any RAID level, other than RAID-0 will achieve 99.99999% or higher availability.
2) Decreasing the number of drives and increasing the number of hot standby drives in a storage array increases availability, considering all other factors to remain the same. The more redundancy you build in, the more available system you are going to achieve.
Popular 8 disk storage array on the market comes with it's out of the box configuration (RAID-5, 7+1+1 hot standby) can be measured as 99.999% available with MTTR between 8 and 24 hours.
See all this in action and calculate your own?
Please visit Storage Availability Calculator
Nash R. Radovanovic is an IT consultant working in Toronto, Ontario. Nash can be reached at nash at the domain bgdsoftware.com
|
Maximum
Achievable Storage Array System Availability |
|||||||
|
Disk Data |
|||||||
|
Disk MTBF |
1000000 |
h |
|
|
|
||
|
Annual Failure Rate (AFR) |
0.00876 |
|
|
|
|
||
|
Reliability (R) |
99.12400% |
|
|
|
|
||
|
Disk MTTR |
24 |
h |
|
|
|
||
|
Annual Repair Rate (ARR) |
0.01148 |
|
|
|
|
||
|
Availability (A) |
98.85243% |
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
Without hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
97.70485% |
99.98683% |
|
|
|
|
|
|
3 |
96.55728% |
|
|
99.97366% |
|
|
|
|
4 |
95.40971% |
99.97366% |
99.94732% |
99.96049% |
99.97366% |
|
|
|
5 |
94.26214% |
|
|
99.94732% |
|
|
|
|
6 |
93.11456% |
99.96049% |
99.88148% |
99.93415% |
99.96049% |
|
|
|
7 |
91.96699% |
|
|
99.92098% |
|
|
|
|
8 |
90.81942% |
99.94732% |
99.78929% |
99.90782% |
99.94732% |
|
With hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
97.70485% |
99.99985% |
|
|
|
|
|
|
3 |
96.55728% |
|
|
99.99970% |
|
|
|
|
4 |
95.40971% |
99.99970% |
99.99940% |
99.99955% |
99.99970% |
|
|
|
5 |
94.26214% |
|
|
99.99940% |
|
|
|
|
6 |
93.11456% |
99.99955% |
99.99864% |
99.99924% |
99.99955% |
|
|
|
7 |
91.96699% |
|
|
99.99909% |
|
|
|
|
8 |
90.81942% |
99.99940% |
99.99758% |
99.99894% |
99.99940% |
|
Maximum
Achievable Storage Array System Availability |
|||||||
|
Disk Data |
|||||||
|
Disk MTBF |
1000000 |
h |
|
|
|
||
|
Annual Failure Rate (AFR) |
0.00876 |
|
|
|
|
||
|
Reliability (R) |
99.12400% |
|
|
|
|
||
|
Disk MTTR |
12 |
h |
|
|
|
||
|
Annual Repair Rate (ARR) |
0.01012 |
|
|
|
|
||
|
Availability (A) |
98.98821% |
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
Without hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
97.97643% |
99.98976% |
|
|
|
|
|
|
3 |
96.96464% |
|
|
99.97953% |
|
|
|
|
4 |
95.95285% |
99.97953% |
99.95905% |
99.96929% |
99.97953% |
|
|
|
5 |
94.94107% |
|
|
99.95905% |
|
|
|
|
6 |
93.92928% |
99.96929% |
99.90787% |
99.94881% |
99.96929% |
|
|
|
7 |
92.91750% |
|
|
99.93858% |
|
|
|
|
8 |
91.90571% |
99.95905% |
99.83621% |
99.92834% |
99.95905% |
|
With hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
97.97643% |
99.99990% |
|
|
|
|
|
|
3 |
96.96464% |
|
|
99.99979% |
|
|
|
|
4 |
95.95285% |
99.99979% |
99.99959% |
99.99969% |
99.99979% |
|
|
|
5 |
94.94107% |
|
|
99.99959% |
|
|
|
|
6 |
93.92928% |
99.99969% |
99.99907% |
99.99948% |
99.99969% |
|
|
|
7 |
92.91750% |
|
|
99.99938% |
|
|
|
|
8 |
91.90571% |
99.99959% |
99.99834% |
99.99927% |
99.99959% |
|
Maximum
Achievable Storage Array System Availability |
|||||||
|
Disk Data |
|||||||
|
Disk MTBF |
1000000 |
h |
|
|
|
||
|
Annual Failure Rate (AFR) |
0.00876 |
|
|
|
|
||
|
Reliability (R) |
99.12400% |
|
|
|
|
||
|
Disk MTTR |
8 |
h |
|
|
|
||
|
Annual Repair Rate (ARR) |
0.00967 |
|
|
|
|
||
|
Availability (A) |
99.03348% |
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
Without hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
98.06695% |
99.99066% |
|
|
|
|
|
|
3 |
97.10043% |
|
|
99.98132% |
|
|
|
|
4 |
96.13390% |
99.98132% |
99.96263% |
99.97197% |
99.98132% |
|
|
|
5 |
95.16738% |
|
|
99.96263% |
|
|
|
|
6 |
94.20085% |
99.97197% |
99.91592% |
99.95329% |
99.97197% |
|
|
|
7 |
93.23433% |
|
|
99.94395% |
|
|
|
|
8 |
92.26781% |
99.96263% |
99.85053% |
99.93461% |
99.96263% |
|
With hot spare |
|
# of drvs |
RAID 0 |
RAID 1 |
RAID 0+1 |
RAID 5 |
RAID 10 |
|
|
|
2 |
98.06695% |
99.99991% |
|
|
|
|
|
|
3 |
97.10043% |
|
|
99.99982% |
|
|
|
|
4 |
96.13390% |
99.99982% |
99.99964% |
99.99973% |
99.99982% |
|
|
|
5 |
95.16738% |
|
|
99.99964% |
|
|
|
|
6 |
94.20085% |
99.99973% |
99.99919% |
99.99955% |
99.99973% |
|
|
|
7 |
93.23433% |
|
|
99.99946% |
|
|
|
|
8 |
92.26781% |
99.99964% |
99.99856% |
99.99937% |
99.99964% |




| Last update on February 5, 2007 | © 1996-2007 Nash R. Radovanovic for BGD Software Inc. All Rights Reserved. |