University of Illinois Center for Supercomputing Research and Development
The Center for Supercomputing Research and Development at the University of Illinois was a research center funded from 1984 to 1993. It built the shared memory Cedar computer system, which included four hardware multiprocessor clusters, as well as parallel system and applications software. It was distinguished from the four earlier UIUC Illiac systems by starting with commercial shared memory subsystems that were based on an earlier paper published by the CSRD founders. Thus CSRD was able to avoid many of the hardware design issues that slowed the Illiac series work. Over its 9 years of major funding, plus follow-on work by many of its participants, CSRD pioneered many of the shared memory architectural and software technologies upon which all 21st century computation is based.
History
UIUC began computer research in the 1950s, initially for civil engineering problems, and eventually succeeded by cooperative activities among the Math, Physics, and Electrical Engineering Departments to build the Illiac computer series. This led to founding the Computer Science Department in 1965.By the early 1980s, a time of world-wide HPC expansion arrived, including the race with the Japanese 5th generation system targeting innovative parallel applications in AI. HPC/supercomputing had emerged as a field, commercial supercomputers were in use by industry and labs, and academic architecture and compiler research were expanding. This led to formation of the Lax committee. to study the academic needs of focused HPC research, and to provide commercial HPC systems for university research. When HPC practitioner Ken Wilson won the Nobel physics prize in 1982, he expanded his already strong advocacy of both, and soon several government agencies introduced HPC R&D programs.
As a result, the UIUC Center for Supercomputing R&D was formed in 1984, under the leadership of three CS professors who had worked together since the Illiac 4 project – David Kuck, Duncan Lawrie and Ahmed Sameh, plus Ed Davidson who joined from ECE. Many graduate students and post-docs were already contributing to constituent efforts; full time academic professionals were hired, and other faculty cooperated. A total of up to 125 people were involved at the peak, over the nine years of full CSRD operation
The UIUC administration responded to the computing and scientific times. CSRD was set up as a Graduate College unit, with space in Talbot Lab. UIUC President Stanley Ikenberry arranged to have Governor James Thompson directly endow CSRD with $1 million per year to guarantee personnel continuity. CSRD management helped write proposals that led to a gift from Arnold Beckman of a $50 million building, the establishment of NCSA, and a new CSRD building.
The CSRD plan for success took a major departure from earlier Illiac machines by integrating four commercially built parallel machines using an innovative interconnection network and global shared memory. Cedar was based on designing and building a limited amount of innovative hardware, driven by SW that was built on top of emerging parallel applications and compiler technology. By breaking the tradition of building hardware first and then dealing with SW details later, this codesign approach led to the name Cedar instead of Illiac 5.
Earlier work by the CSRD founders had intensively studied a variety of new high-radix interconnection networks, built tools to measure the parallelism in sequential programs, designed and built a restructuring compiler to transform sequential programs into parallel forms, as well as inventing parallel numerical algorithms. During the Parafrase development of the 1970s, several papers were published proposing ideas for expressing and automatically optimizing parallelism.
These ideas influenced later compiler work at IBM, Rice U. and elsewhere. Parafrase had been donated to Fran Allen's IBM PTRAN group in the late 1970s, Ken Kennedy had gone there on sabbatical and obtained a Parafrase copy, and Ron Cytron joined the IBM group from UIUC. Also, KAI was founded in 1979, by three Parafrase veterans who wrote KAP, a new source-source restructurer,.
The key Cedar idea was to exploit feasible-scale parallelism, by linking together a number of shared memory nodes through an interconnection network and memory hierarchy. Alliant Computers, Inc. Alliant Computer Systems had obtained venture capital funding, based on an earlier architecture paper by the CSRD team and was then shipping systems. The Cedar team was thus immediately able to focus on designing hardware to link 4 Alliant systems and add a global shared memory to the Alliant 8-processor shared memory nodes. In distinction to this, other academic teams of the era pursued massively parallel systems, fetch-and-add combining networks, innovative caching, dataflow systems, etc.
In sharp contrast, two decades earlier, the Illiac 4 team required years of work with state of the art industry hardware technology leaders to get the system designed and built. The 1966 industrial hardware proposals for Illiac 4 hardware technology even included a GE Josephson Junction proposal which John Bardeen helped evaluate while he was developing the theory that led to his superconductivity Nobel prize. After contracting with Burroughs Corp to build and integrate an all-transistor hardware system, lengthy discussions ensued about the semiconductor memory design with subcontractor Texas Instruments' Jack Kilby, Morris Chang and others. Earlier Illiac teams had pushed contemporary technologies, with similar implementation problems and delays.
Many attempts at parallel computing startups arose in the decades following Illiac 4, but nothing achieved success until adequate languages and software were developed in the 1970s and 80s. Parafrase veteran Steve Chen joined Cray and led development of the parallel/vector Cray-XMP, released in 1982. The 1990s were a turning point with many 1980s startups failing, the end of bipolar technology cost-effectiveness, and the general end of academic computer building. By the 2000s, with Intel and others manufacturing massive numbers of systems, shared memory parallelism had become ubiquitous.
CSRD and the Cedar system played key roles in advancing shared memory system effectiveness. Many CSRD innovations of the late 80s are in common use today, including hierarchical shared memory hardware. Cedar also had parallel Fortran extensions, a vectorizing and parallelizing compiler, and custom Linux-based OS, that were used to develop advanced parallel algorithms and applications. These will be detailed below.
Cedar design and construction
One unusually productive aspect of the Cedar design effort was the ongoing cooperation among the R&D efforts of architects, compiler writers, and application developers. Another was the substantial legacy of ideas and people from the Parafrase project in the 1970s. These enabled the team to focus on several design topics quickly:- Interconnection network and shared memory hierarchy
- Compiler algorithms, OS, and SW tools
- Applications and performance analysis
. Designing, building and integrating the system was then a multi-year effort, including architecture, hardware, compiler, OS and algorithm work.
System Architecture & Hardware
The hardware design led to 3 different types of 24” printed circuit boards, with the network board using CSRD-designed crossbar gate array chips. The boards were assembled into three custom racks in a machine room in Talbot Lab using water-cooled heat exchangers. Cedar’s key architectural innovations and features included:- A hierarchical/cluster-based shared-memory multiprocessor design using Alliant FX as the building block. This approach is still being followed in today’s parallel machines, where Alliant 8-processor systems have been shrunken to single chip multi-core nodes containing 16, 32 or more cores, depending on power and thermal limitations.
- The first SMP use of a scalable high-radix multi-stage, shuffle-exchange interconnection network, i.e. a 2-stage Omega network using 8x8 crossbar switches as building blocks. In 2005, the Cray BlackWidow Cray X2 used a variant of such networks in the form of a high radix 3-stage Clos network, with 32x32 crossbar switches. Subsequently many other systems have adopted the idea.
- The first hardware data prefetcher, using a “next-array-element” prefetching scheme to load array data from the shared global memory. Data prefetching is a critical technology on today’s multicores.
- The first “processor-in-memory” in its shared global memory to perform long-latency synchronization operations. Today, using PIM to carry out various operations in shared global memory is still an active architectural research area.
- Software-combining techniques for scalable synchronization operations
Language and compiler
Cedar Fortran contained a two-level parallel loop hierarchy that reflected the Cedar architecture. Each iteration of outer parallel loops made use of one cluster and a second level parallel loop made use of one of the eight processors of a cluster for each of its iterations. Cedar Fortran also contained primitives for doacross synchronization and control of critical sections. Outer-level parallel loops were initiated, scheduled and synchronized using a runtime library while inner loops relied on Alliant hardware instructions to initiate the loops, schedule and synchronize their iterations.
Global variables and arrays were allocated in global memory while those declared local to iterations of outer parallel loops were allocated within clusters. There were no caches between clusters and main memory and therefore, programmers had to explicitly copy from global memory to local memory to attain faster memory accesses. These mechanisms worked well in all cases tested and gave programmers control over processor assignment and memory allocation. As discussed in the next section, numerous applications were implemented in Cedar Fortran.
Cedar compiler work started with the development of a Fortran parallelizer for Cedar built by extending KAP, a vectorizer, which was contributed by KAI to CSRD. Because it was built on a vectorizer the first modified version of KAP developed at CSRD lacked some important capabilities necessary for an effective translation for multiprocessors, such as array privatization and parallelization of outer loops. Unlike Parafrase, which ran only on IBM machines, KAP ran on many machines. To identify the missing capabilities and develop the necessary translation algorithms, a collection of Fortran programs from the Perfect Benchmarks was parallelized by hand. Only techniques that were considered implementable were used in the manual parallelization study. The techniques were later used for a second generation parallelizer that proved effective on collections of programs not used in the manual parallelization study