Tandem Computers


Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and no data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's 2002 acquisition of Compaq and its 2015 split into HP Inc. and Hewlett Packard Enterprise.
Tandem's NonStop systems use a number of independent identical processors, redundant storage devices, and redundant controllers to provide automatic high-speed "failover" in the case of a hardware or software failure. To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state.
Besides masking failures, this "shared-nothing" messaging system design also scales to the largest commercial workloads. Each doubling of the total number of processors doubles system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors in that manner gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete against IBM's largest mainframes, despite being built from simpler minicomputer technology.

Founding

Tandem Computers was founded in 1974 by James Treybig. Treybig first saw the market need for fault tolerance in OLTP systems while running a marketing team for Hewlett-Packard 's HP 3000 computer division, but HP was not interested in developing for this niche. He then joined the venture capital firm Kleiner Perkins and developed the Tandem business plan there. Treybig pulled together a core engineering team hired away from the HP 3000 division: Mike Green, Jim Katzman, Dave Mackie and Jack Loustaunou. Their business plan called for ultra-reliable systems that never had outages and never lost or corrupted data. These were modular in a new way that was safe from all "single-point failures" yet would be only marginally more expensive than conventional non-fault-tolerant systems. They would be less expensive and support more throughput than some existing ad-hoc toughened systems that used redundant but usually required "hot spares".
Each engineer was confident they could quickly pull off their own part of this complex new design but doubted that others' areas could be worked out. The parts of the hardware and software design that did not have to be different were largely based on incremental improvements to the familiar hardware and software designs of the HP 3000. Many subsequent engineers and programmers also came from HP. Tandem headquarters in Cupertino, California, were a quarter mile away from the HP offices. Initial venture capital investment in Tandem Computers came from Tom Perkins, who was formerly a general manager of the HP 3000 division.
The business plan included detailed ideas for building a unique corporate culture reflecting Treybig's values.
The design of the initial Tandem/16 hardware was completed in 1975, and the first system shipped to Citibank in May 1976.
The company enjoyed uninterrupted exponential growth through 1983. Inc. magazine ranked Tandem as the fastest-growing public company in America. By 1996, Tandem was a $2.3 billion company employing approximately 8,000 people worldwide.

Tandem NonStop (TNS) stack machines

Over 40 years, Tandem's main NonStop product line grew and evolved in an upward-compatible way from the initial T/16 fault-tolerant system, with three major changes to its top-level modular architecture or its programming-level instruction set architecture. Within each series, there have been several major re-implementations as chip technology progressed.
While conventional systems of the era, including large mainframes, had mean-time-between-failures on the order of a few days, the NonStop system was designed to failure intervals 100 times longer, with uptimes measured in years. Nevertheless, the NonStop was designed to be price-competitive with conventional systems, with a simple 2-CPU system priced at just over twice that of a competing single-processor mainframe, as opposed to four or more times of other fault-tolerant solutions.

NonStop I

The first system was the Tandem/16 or T/16, later re-branded NonStop I. The machine consisted of between two and 16 CPUs, organized as a fault-tolerant computer cluster packaged in a single rack. Each CPU had its own private, unshared memory, its own I/O processor, its own private I/O bus to connect to I/O controllers, and dual connections to all the other CPUs over a custom inter-CPU backplane bus called Dynabus. Each disk controller or network controller was duplicated and had dual connections to both CPUs and devices. Each disk was mirrored, with separate connections to two independent disk controllers. If a disk failed, its data was still available from its mirrored copy. If a CPU, controller or bus failed, the disk was still reachable through alternative CPU, controller, and/or bus. Each disk or network controller was connected to two independent CPUs. Power supplies were each wired to only one side of a pair of CPUs, controllers, or buses, so that the system would keep running without loss of connections if one power supply failed. The careful complex arrangement of parts and connections in customers' larger configurations were documented in a Mackie diagram, named after lead salesman David Mackie, who invented the notation. None of these duplicated parts were wasted "hot spares"; everything added to system throughput during normal operations.
Besides recovering well from failed parts, the T/16 was also designed to detect as many kinds of intermittent failures as possible, as soon as possible. This prompt detection is called "fail fast". The point was to find and isolate corrupted data before it was permanently written into databases and other disk files. In the T/16, error detection was by added custom circuits that added little cost to the total design; no major parts were duplicated to get error detection.
The T/16 CPU was a proprietary design. It was greatly influenced by the HP 3000 minicomputer. They were both microprogrammed, 16-bit, stack-based machines with segmented, 16-bit virtual addressing. Both were intended to be programmed exclusively in high-level languages, with no use of assembler. Both were initially implemented via standard low-density TTL chips, each holding a 4-bit slice of the 16-bit ALU. Both had a small number of top-of-stack, 16-bit data registers plus some extra address registers for accessing the memory stack. Both used Huffman encoding of operand address offsets, to fit a large variety of address modes and offset sizes into the 16-bit instruction format with good code density. Both relied heavily on pools of indirect addresses to overcome the short instruction format. Both supported larger 32- and 64-bit operands via multiple ALU cycles, and memory-to-memory string operations. Both used "big-endian" addressing of long versus short memory operands. These features had all been inspired by Burroughs B5500–B6800 mainframe stack machines.
The T/16 instruction set changed several features from the HP 3000 design. The T/16 supported paged virtual memory from the beginning. The HP 3000 series did not add paging until the PA-RISC generation, 10 years later. Tandem added support for 32-bit addressing in its second machine; HP 3000 lacked this until its PA-RISC generation. Paging and long addresses were critical for supporting complex system software and large applications. The T/16 treated its top-of-stack registers in a novel way; the compiler, not the microcode, was responsible for deciding when full registers were spilled to the memory stack and when empty registers were re-filled from the memory stack. On the HP 3000, this decision took extra microcode cycles in every instruction. The HP 3000 supported COBOL with several instructions for calculating directly on arbitrary-length BCD strings of digits. The T/16 simplified this to single instructions for converting between BCD strings and 64-bit binary integers.
In the T/16, each CPU consisted of two boards of TTL logic and SRAMs, and ran at about 0.7 MIPS. At any instant, it could access only four virtual memory segments, each limited to 128 KB in size. The 16-bit address spaces were already small for major applications when it shipped.
The first release of T/16 had only a single programming language, Transaction Application Language. This was an efficient machine-dependent systems programming language but could also be used for non-portable applications. It was derived from HP 3000's System Programming Language. Both had semantics similar to C but a syntax based on Burroughs' ALGOL. Subsequent releases added support for Cobol74, Basic, Fortran, Java, C, C++, and MUMPS.
The Tandem NonStop series ran a custom operating system which was significantly different from Unix or HP 3000's MPE. It was initially called T/TOS but soon named Guardian for its ability to protect all data from machine faults and software faults. In contrast to all other commercial operating systems, Guardian was based on message passing as the basic way for all processes to interact, without shared memory, regardless of where the processes were running. This approach easily scaled to multiple-computer clusters and helped isolate corrupted data before it propagated.
All file system processes and all transactional application processes were structured as master/slave pairs of processes running in separate CPUs. The slave process periodically took snapshots of the master's memory state and took over the workload if and when the master process ran into trouble. This allowed the application to survive failures in any CPU or its associated devices, without data loss. It further allowed recovery from some intermittent-style software failures. Between failures, the monitoring by the slave process added some performance overhead but this was far less than the 100% duplication in other system designs. Some major early applications were directly coded in this checkpoint style, but most instead used various Tandem software layers which hid the details of this in a semi-portable way.