Cray-1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, eighty Cray-1s were sold, making it one of the most successful supercomputers in history. It is perhaps best known for its unique shape, a relatively small C-shaped cabinet with a ring of benches around the outside covering the power supplies and the cooling system.
The Cray-1 was the first supercomputer to successfully implement the vector processor design. These systems improve the performance of math operations by arranging memory and registers to quickly perform a single operation on a large set of data. Previous systems like the CDC STAR-100 and ASC had implemented these concepts but did so in a way that seriously limited their performance. The Cray-1 addressed these problems and produced a machine that ran several times faster than any similar design.
The Cray-1's architect was Seymour Cray; the chief engineer was Cray Research co-founder Lester Davis. They would go on to design several new machines using the same basic concepts, and retained the performance crown into the 1990s.
History
From 1968 to 1972, Seymour Cray of Control Data Corporation worked on the CDC 8600, the successor to his earlier CDC 6600 and CDC 7600 designs. The 8600 was essentially made up of four 7600s in a box with an additional special mode that allowed them to operate lock-step in a SIMD fashion.Jim Thornton, formerly Cray's engineering partner on earlier designs, had started a more radical project known as the CDC STAR-100. Unlike the 8600's brute-force approach to performance, the STAR took an entirely different route. The main processor of the STAR had lower performance than the 7600, but added hardware and instructions to speed up particularly common supercomputer tasks.
By 1972, the 8600 had reached a dead end; the machine was so incredibly complex that it was impossible to get one working properly. Even a single faulty component would render the machine non-operational. Cray went to William Norris, Control Data's CEO, saying that a redesign from scratch was needed. At the time, the company was in serious financial trouble, and with the STAR in the pipeline as well, Norris could not invest the money.
As a result, Cray left CDC and started Cray Research very close to the CDC lab. In the back yard of the land he purchased in Chippewa Falls, Wisconsin, Cray and a group of former CDC employees started looking for ideas. At first, the concept of building another supercomputer seemed impossible, but after Cray Research's Chief Technology Officer travelled to Wall Street and found a lineup of investors willing to back Cray, all that was needed was a design.
For four years Cray Research designed its first computer. In 1975 the 80 MHz Cray-1 was announced. The excitement was so high that a bidding war for the first machine broke out between Lawrence Livermore National Laboratory and Los Alamos National Laboratory, the latter eventually winning and receiving serial number 001 in 1976 for a six-month trial. The National Center for Atmospheric Research was the first official customer of Cray Research in 1977, paying US$8.86 million for serial number 3. The NCAR machine was decommissioned in 1989. The company expected to sell perhaps a dozen of the machines, and set the selling price accordingly, but ultimately over 80 Cray-1s of all types were sold, priced from $5M to $8M. The machine made Seymour Cray a celebrity and his company a success, lasting until the supercomputer crash in the early 1990s.
Based on a recommendation by William Perry's study, the NSA purchased a Cray-1 for theoretical research in cryptanalysis. According to Budiansky, "Though standard histories of Cray Research would persist for decades in stating that the company's first customer was Los Alamos National Laboratory, in fact it was NSA..."
The 160 MFLOPS Cray-1 was succeeded in 1982 by the 800 MFLOPS Cray X-MP, the first Cray multi-processing computer. In 1985, the very advanced Cray-2, capable of 1.9 GFLOPS peak performance, succeeded the first two models but met a somewhat limited commercial success because of certain problems at producing sustained performance in real-world applications. A more conservatively designed evolutionary successor of the Cray-1 and X-MP models was therefore made by the name Cray Y-MP and launched in 1988.
By comparison, the processor in a typical 2013 smart device, such as a Google Nexus 10 or HTC One, performs at roughly 1 GFLOPS, while the A13 processor in a 2019 iPhone 11 performs at 154.9 GFLOPS, a mark supercomputers succeeding the Cray-1 would not reach until 1994.
Background
Typical scientific workloads consist of reading in large data sets, transforming them in some way and then writing them back out again. Normally the transformations being applied are identical across all of the data points in the set. For instance, the program might add 5 to every number in a set of a million numbers.In simple computers the program would loop over all million numbers, adding five, thereby executing a million instructions saying
a = add b, c. Internally the computer solves this instruction in several steps. First it reads the instruction from memory and decodes it, then it collects any additional information it needs, in this case the numbers b and c, and then finally runs the operation and stores the results. The end result is that the computer requires tens or hundreds of millions of cycles to carry out these operations.Vector machines
In the STAR, new instructions essentially wrote the loops for the user. The user told the machine where in memory the list of numbers was stored, then fed in a single instructiona = addv b, c. At first glance it appears the savings are limited; in this case the machine fetches and decodes only a single instruction instead of 1,000,000, thereby saving 1,000,000 fetches and decodes, perhaps one-fourth of the overall time.The real savings are not so obvious. Internally, the CPU of the computer is built up from a number of separate parts dedicated to a single task, for instance, adding a number, or fetching from memory. Normally, as the instruction flows through the machine, only one part is active at any given time. This means that each sequential step of the entire process must complete before a result can be saved. The addition of an instruction pipeline changes this. In such machines the CPU will "look ahead" and begin fetching succeeding instructions while the current instruction is still being processed. In this assembly line fashion any one instruction still requires as long to complete, but as soon as it finishes executing, the next instruction is right behind it, with most of the steps required for its execution already completed.
Vector processors use this technique with one additional trick. Because the data layout is in a known format — a set of numbers arranged sequentially in memory — the pipelines can be tuned to improve the performance of fetches. On the receipt of a vector instruction, special hardware sets up the memory access for the arrays and stuffs the data into the processor as fast as possible.
CDC's approach in the STAR used what is today known as a memory-memory architecture. This referred to the way the machine gathered data. It set up its pipeline to read from and write to memory directly. This allowed the STAR to use vectors of length not limited by the length of registers, making it highly flexible. Unfortunately, the pipeline had to be very long in order to allow it to have enough instructions in flight to make up for the slow memory. That meant the machine incurred a high cost when switching from processing vectors to performing operations on non-vector operands. Additionally, the low scalar performance of the machine meant that after the switch had taken place and the machine was running scalar instructions, the performance was quite poor. The result was rather disappointing real-world performance, something that could, perhaps, have been forecast by Amdahl's law.
Cray's approach
Cray studied the failure of the STAR and learned from it. He decided that in addition to fast vector processing, his design would also require excellent all-around scalar performance. That way when the machine switched modes, it would still provide superior performance. Additionally he noticed that the workloads could be dramatically improved in most cases through the use of registers.Just as earlier machines had ignored the fact that most operations were being applied to many data points, the STAR ignored the fact that those same data points would be repeatedly operated on. Whereas the STAR would read and process the same memory five times to apply five vector operations on a set of data, it would be much faster to read the data into the CPU's registers once, and then apply the five operations. However, there were limitations with this approach. Registers were significantly more expensive in terms of circuitry, so only a limited number could be provided. This implied that Cray's design would have less flexibility in terms of vector sizes. Instead of reading any sized vector several times as in the STAR, the Cray-1 would have to read only a portion of the vector at a time, but it could then run several operations on that data prior to writing the results back to memory. Given typical workloads, Cray felt that the small cost incurred by being required to break large sequential memory accesses into segments was a cost well worth paying.
Since the typical vector operation would involve loading a small set of data into the vector registers and then running several operations on it, the vector system of the new design had its own separate pipeline. For instance, the multiplication and addition units were implemented as separate hardware, so the results of one could be internally pipelined into the next, the instruction decode having already been handled in the machine's main pipeline. Cray referred to this concept as chaining, as it allowed programmers to "chain together" several instructions and extract higher performance.