Skip to main content

High Availability

High availability is a characteristic of a system, which describes the duration (length of time) for which the system is operational. So when it is said that some service has 99% availability across a year, it means that from across entire duration in a year (24*365=8760 hours, not accounting for leap years), the service would be operational for 8672.4 hours. A highly available system will be one which is operational for minimum duration within the specified length of time (usually a year).

The literal definition of availability is

Ao = up time / total time.

This equation is not practically useful, but if (total time - down time) is substituted for up time then you have

Ao = (total time - down time) / total time.

Determining tolerable down time is practical. From that, the required availability may be easily calculated.

High availability system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

There are three principles of high availability engineering. They are:

  1. Elimination of single points of failure. This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
  2. Reliable crossover. In multithreaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
  3. Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.

Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is - from the users point of view - unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.

Scheduled and unscheduled downtime

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various application, middleware, and operating system failures.

If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.

Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.

Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.

 

Availability % Downtime per year Downtime per month Downtime per week Downtime per day
90% ("one nine") 36.5 days 72 hours 16.8 hours 2.4 hours
95% 18.25 days 36 hours 8.4 hours 1.2 hours
97% 10.96 days 21.6 hours 5.04 hours 43.2 minutes
98% 7.30 days 14.4 hours 3.36 hours 28.8 minutes
99% ("two nines") 3.65 days 7.20 hours 1.68 hours 14.4 minutes
99.5% 1.83 days 3.60 hours 50.4 minutes 7.2 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes 2.88 minutes
99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes 43.2 seconds
99.99% ("four nines") 52.56 minutes 4.38 minutes 1.01 minutes 8.66 seconds
99.995% 26.28 minutes 2.16 minutes 30.24 seconds 4.32 seconds
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 864.3 milliseconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 604.8 milliseconds 86.4 milliseconds
99.99999% ("seven nines") 3.15 seconds 262.97 milliseconds 60.48 milliseconds 8.64 milliseconds
99.999999% ("eight nines") 315.569 milliseconds 26.297 milliseconds 6.048 milliseconds 0.864 milliseconds
99.9999999% ("nine nines") 31.5569 milliseconds 2.6297 milliseconds 0.6048 milliseconds 0.0864 milliseconds

Uptime and availability can be used synonymously, as long as the item being discussed are kept consistent. That is, a system can be up, but its services are not available, as in the case of a network outage. This can also be viewed as, a system can be available to work on, but its services are not up from a functional perspective (as oppose to software service/process perspective). The perspective is important here, whether the item being discussed is the server hardware, server OS, functional service, software service/process...etc. Keep the perspective consistent throughout a discussion, then uptime and availability can be used synonymously.

Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions (blackouts, brownouts or surges) 99.999% of the time would have 5 nines reliability, or class five. In particular, the term is used in connection with mainframes or enterprise computing.

In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.

The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.

For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates.

A formulation of the class of 9s based on a system's unavailability would be

 

 

Source: Wikipedia