Fault tolerance
Fault tolerance is the ability of a system to contain the propagation of faults. Faults may manifest as errors causing incorrect system state. Failure tolerant systems mask errors and maintain failure-free operation in the presence of one or more faulty components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.
Typically, fault tolerance describes computer systems, ensuring the overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue, corrosion or impact.
History
The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonín Svoboda. Its basic design was magnetic drums connected via relays, with a voting method of memory error detection. Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories:- Machines that would last a long time without any maintenance, such as the ones used on NASA space probes and satellites;
- Computers that were very dependable but required constant monitoring, such as those used to monitor and control nuclear power plants or supercollider experiments; and
- Computers with a high amount of runtime that would be under heavy use, such as many of the supercomputers used by insurance companies for their probability monitoring.
Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear power companies, and the railroad industry in the United States. These entities needed computers with massive amounts of uptime that would fail gracefully enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets. Later, BNSF, Unisys, and General Electric built their own.
In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy.
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results. For example, if four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.
Historically, the trend has been to move away from N-model and toward M out of N, as the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.
Tandem Computers, in 1976 and Stratus Technologies were among the first companies specializing in the design of fault-tolerant computer systems for online transaction processing.
Examples
Hardware fault tolerance sometimes requires that broken parts be removed and replaced with new parts while the system is still operational. Such a system implemented with a single backup is known as single point tolerant and represents the vast majority of fault-tolerant systems. In such systems the mean time between failures should be long enough for the operators to have sufficient time to fix the broken devices before the backup also fails. It is helpful if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system.Fault tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single-point tolerance to create their NonStop systems with uptimes measured in years.
Fail-safe architectures may encompass also the computer software, for example by process replication.
Data formats may also be designed to degrade gracefully. Hyper Text Markup Language for example, is designed to be forward compatible, allowing Web browsers to ignore new and unsupported HTML entities without causing the document to be unusable. Additionally, proper user interface design mandates that websites provide an optional lightweight front end that does not rely on JavaScript and has a minimal layout, to ensure wide accessibility and outreach.
Terminology
A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.A system that is designed to fail safe, or fail-secure, or fail gracefully, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit to prevent data corruption after an error occurs. A similar distinction is made between "failing well" and "failing badly".
A system designed to experience graceful degradation, or to fail soft operates at a reduced level of performance after some component fails. For example, if grid power fails, a building may operate lighting at reduced levels or elevators at reduced speeds. In computing, if insufficient network bandwidth is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version. Progressive enhancement is another example, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display.
In fault-tolerant computer systems, programs that are considered robust are designed to continue operation despite an error, exception, or invalid input, instead of failing completely. Software brittleness is the opposite of robustness. Resilient networks continue to transmit data despite the failure of some links or nodes. Resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.
A system with high failure transparency will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. Likewise, a fail-fast component is designed to report at the first point of failure, rather than generating reports when downstream components fail. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.
A single fault condition is a situation where one means for protection against a hazard is defective. If a single fault condition results unavoidably in another single fault condition, the two failures are considered one single fault condition. A source offers the following example:
Criteria
Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:- How critical is the component? In a car, the radio is not critical, so this component has less need for fault tolerance.
- How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
- How expensive is it to make the component fault tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable or hydraulic fluid. Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure, the loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely, there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a halt in conjunction with transmission/engine braking so long as the demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case.
In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is built into it per se, and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring the vehicle to a halt.
On motorcycles, a similar level of fail-safety is provided by simpler methods; first, the front and rear brake systems are entirely separate, regardless of their method of activation, allowing one system to fail entirely while leaving the other unaffected. Second, the rear brake is relatively strong compared to its automotive cousin, being a powerful disc on some sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tire is generally larger and has better traction, so that the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before the wheel locks. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the foot pedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like many low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.