High availability
High availability is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
There is now more dependence on these systems as a result of modernization. For example, to carry out their regular daily tasks, hospitals and data centers need their systems to be highly available. Availability refers to the ability of the user to access a service or system, whether to submit new work, update or modify existing work, or retrieve the results of previous work. If a user cannot access the system, it is considered unavailable from the user's perspective. The term downtime is generally used to refer to describe periods when a system is unavailable.
Resilience
High availability is a property of network resilience, the ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation." Threats and challenges for services can range from simple misconfiguration over large scale natural disasters to targeted attacks. As such, network resilience touches a very wide range of topics. In order to increase the resilience of a given communication network, the probable challenges and risks have to be identified and appropriate resilience metrics have to be defined for the service to be protected.The importance of network resilience is continuously increasing, as communication networks are becoming a fundamental component in the operation of critical infrastructures. Consequently, recent efforts focus on interpreting and improving network and computing resilience with applications to critical infrastructures. As an example, one can consider as a resilience objective the provisioning of services over the network, instead of the services of the network itself. This may require coordinated response from both the network and from the services running on top of the network.
These services include:
- supporting distributed processing
- supporting network storage
- maintaining service of communication services such as
- * video conferencing
- * instant messaging
- * online collaboration
- access to applications and data as needed
Principles
There are three principles of systems design in reliability engineering that can help achieve high availability.- Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
- Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
- Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.
Scheduled and unscheduled downtime
If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.
Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example, system downtime at an office building after everybody has gone home for the night.
Percentage calculation
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.| Availability % | Downtime per year | Downtime per quarter | Downtime per month | Downtime per week | Downtime per day |
| 90% | 36.53 days | 9.13 days | 73.05 hours | 16.80 hours | 2.40 hours |
| 95% | 18.26 days | 4.56 days | 36.53 hours | 8.40 hours | 1.20 hours |
| 97% | 10.96 days | 2.74 days | 21.92 hours | 5.04 hours | 43.20 minutes |
| 98% | 7.31 days | 43.86 hours | 14.61 hours | 3.36 hours | 28.80 minutes |
| 99% | 3.65 days | 21.9 hours | 7.31 hours | 1.68 hours | 14.40 minutes |
| 99.5% | 1.83 days | 10.98 hours | 3.65 hours | 50.40 minutes | 7.20 minutes |
| 99.8% | 17.53 hours | 4.38 hours | 87.66 minutes | 20.16 minutes | 2.88 minutes |
| 99.9% | 8.77 hours | 2.19 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes |
| 99.95% | 4.38 hours | 65.7 minutes | 21.92 minutes | 5.04 minutes | 43.20 seconds |
| 99.99% | 52.60 minutes | 13.15 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
| 99.995% | 26.30 minutes | 6.57 minutes | 2.19 minutes | 30.24 seconds | 4.32 seconds |
| 99.999% | 5.26 minutes | 1.31 minutes | 26.30 seconds | 6.05 seconds | 864.00 milliseconds |
| 99.9999% | 31.56 seconds | 7.89 seconds | 2.63 seconds | 604.80 milliseconds | 86.40 milliseconds |
| 99.99999% | 3.16 seconds | 0.79 seconds | 262.98 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
| 99.999999% | 315.58 milliseconds | 78.89 milliseconds | 26.30 milliseconds | 6.05 milliseconds | 864.00 microseconds |
| 99.9999999% | 31.56 milliseconds | 7.89 milliseconds | 2.63 milliseconds | 604.80 microseconds | 86.40 microseconds |
| 99.99999999% | 3.16 milliseconds | 788.40 microseconds | 262.80 microseconds | 60.48 microseconds | 8.64 microseconds |
| 99.999999999% | 315.58 microseconds | 78.84 microseconds | 26.28 microseconds | 6.05 microseconds | 864.00 nanoseconds |
| 99.9999999999% | 31.56 microseconds | 7.88 microseconds | 2.63 microseconds | 604.81 nanoseconds | 86.40 nanoseconds |
The terms uptime and availability are often used interchangeably but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a network outage. Or a system undergoing software maintenance can be "available" to be worked on by a system administrator, but its services do not appear "up" to the end user or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously.
Five-by-five mnemonic
A simple mnemonic rule states that 5 nines allows approximately 5 minutes of downtime per year. Variants can be derived by multiplying or dividing by 10: 4 nines is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes and 7 nines is 3 seconds."Powers of 10" trick
Another memory trick to calculate the allowed downtime duration for an "-nines" availability percentage is to use the formula seconds per day.For example, 90% yields the exponent, and therefore the allowed downtime is seconds per day.
Also, 99.999% gives the exponent, and therefore the allowed downtime is seconds per day.
"Nines"
Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions 99.999% of the time would have 5 nines reliability, or class five. In particular, the term is used in connection with mainframes or enterprise computing, often as part of a service-level agreement.Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5. This is casually referred to as "three and a half nines", but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines : 99.95% availability is 3.3 nines, not 3.5 nines. More simply, going from 99.9% availability to 99.95% availability is a factor of 2, but going from 99.95% to 99.99% availability is a factor of 5, over twice as much.
A formulation of the class of 9s based on a system's unavailability would be
.
A similar measurement is sometimes used to describe the purity of substances.
In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability, or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents. The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence. For large amounts of 9s, the "unavailability" index is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates.
Sometimes the humorous term "nine fives" is used to contrast with "five nines", though this is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target.