Vector processor


In computing, a vector processor is a central processing unit that implements an instruction set where its instructions are designed to operate efficiently and architecturally sequentially on large one-dimensional arrays of data called vectors. When integrated as a hardware component the vector processor is often called a vector processing unit. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data or SIMD within a register Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation, compression and similar tasks.
Vector processing techniques also operate in video-game console hardware and in graphics accelerators but these are invariably Single instruction, multiple threads and occasionally Single instruction, multiple data.
Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s.

History

Early research and development

Array processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit. The CPU fed a single common instruction to all of the arithmetic logic units, one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array, leading it to be cited as an example Array processor in Flynn's taxonomy.
In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category of massively parallel computing: around 1972 Flynn categorized this type of processing as an early form of single instruction, multiple threads.
International Computers Limited sought to avoid many of the difficulties with the ILLIAC concept with its own Distributed Array Processor design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.

Computer for operations with functions

A computer for operations with functions was presented and developed by Kartsev in 1967.

Supercomputers

The first vector supercomputers are the Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer, which were introduced in 1974 and 1972, respectively.
The basic ASC ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
The STAR-100 was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.
The vector technique was first fully exploited in 1976 by the famous Cray-1. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight vector registers, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.
The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era.
Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies Fujitsu, Hitachi and Nippon Electric Corporation introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller: the Fujitsu VP series culminated in the VP2600 holding the world record fastest supercomputer in 1990-1991. Oregon-based Floating Point Systems built add-on array processors for minicomputers, later building their own minisupercomputers.
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor. IBM also implemented Cray-style Vector processing in the IBM 3090 with an optional vector facility.
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.

GPU

Modern graphics processing units include an array of shader pipelines which may be driven by compute kernels, and being usually SIMT are frequently miscategorised as vector processors, the two being very similar categorisation of SIMD in Flynn's 1972 SIMD taxonomy. GPUs use a strategy for hiding memory latencies as they run into an extreme form of the memory wall. As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is significantly more complex and involved than "Packed SIMD", which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.

Recent development

Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as the early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC.

Comparison with modern architectures

most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple data sets, and borrow features from vector processors. However, by definition, the addition of SIMD cannot, by itself, qualify a processor as an actual vector processor, because SIMD is, and vectors are. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.
Other CPU designs include some multiple instructions for vector processing on multiple data sets, typically known as MIMD and realized with VLIW and EPIC. The Fujitsu FR-V low-power multimedia processor combines VLIW with predication and Packed SIMD.