QPACE2
QPACE 2 is a massively parallel and scalable supercomputer. It was designed for applications in lattice quantum chromodynamics but is also suitable for a wider range of applications..
Overview
QPACE 2 is a follow-up to the QPACE supercomputer and the iDataCool hot-water cooling project.It is a combined effort of the particle physics group at the University of Regensburg and the Italian company Eurotech. The academic design team consisted of about 10 junior and senior physicists.
Details of the project are described in.
QPACE 2 uses Intel Xeon Phi processors, interconnected by a combination of PCI Express and FDR InfiniBand.
The main features of the QPACE 2 prototype installed at the University of Regensburg are
- scalability
- high packaging density
- warm-water cooling
- high energy efficiency
- cost-effective design
QPACE 2 was funded by the German Research Foundation in the framework of SFB/TRR-55 and by Eurotech.
Architecture
Many current supercomputers are hybrid architectures that use accelerator cards with a PCIe interface to boost the compute performance. In general, server processors support only a limited number of accelerators due to the limited number of PCIe lanes. The common approach to integrate multiple accelerators cards into the host system is to arrange multiple server processors, typically two or four, as distributed shared memory systems. This approach allows for a higher number of accelerators per compute node due to the higher number of PCIe lanes. However, it also comes with several disadvantages:- The server processors, their interconnects and memory chips significantly increase the foot-print of the host system.
- Expenses for the multiprocessor design are typically high.
- Server processors significantly contribute to the overall power signature of hybrid computer architectures and need appropriate cooling capacities.
- The server processor interconnect can hinder efficient intra-node communication and impose limitations on the performance of inter-node communication via the external network.
- The compute performance of server processors is typically an order of magnitude lower than that of accelerator cards, thus their contribution to the overall performance can be rather small.
- The instruction set architectures and hardware resources of server processors and accelerators differ significantly. Therefore, it is not always feasible for code to be developed for and to be executed on both architectures.
The QPACE 2 rack contains 64 compute nodes. 32 nodes each are on the front- and backside of the rack. The power subsystem consists of 48 power supplies that deliver an aggregate peak power of 96 kW. QPACE 2 relies on a warm-water cooling solution to achieve this packaging and power density.
Compute node
The QPACE 2 node consists of commodity hardware interconnected by PCIe. The midplane hosts a 96-lane PCIe switch, provides six 16-lane PCIe Gen3 slots, and delivers power to all slots. One slot is used for the CPU card, which is a PCIe form factor card containing one Intel Haswell E3-1230L v3 server processor with 16 GB DDR3 memory as well as a microcontroller to monitor and control the node. Four slots are used for Xeon Phi 7120X cards with 16 GB GDDR5 each, and one slot for a dual-port FDR InfiniBand network interface card.The midplane and the CPU card were designed for the QPACE 2 project but can be reused for other projects or products.
The low-power Intel E3-1230L v3 server CPU is energy-efficient, but weak in computational power compared to other server processors available around 2015. The CPU does not contribute significantly to the compute power of the node. It is merely running the operating system and system-relevant drivers. Technically, the CPU serves as a root complex for the PCIe fabric. The PCIe switch extends the host CPU's limited number of PCIe lanes to a total of 80 lanes, therefore enabling a multitude of components to be connected to the CPU as PCIe endpoints. This architecture also allows the Xeon Phis to do peer-to-peer communication via PCIe and to directly access the external network without having to go through the host CPU.
Each QPACE 2 node comprises 248 physical cores. Host processor and accelerators support multithreading. The number of logical cores per node is 984.
The design of the node is not limited to the components used in QPACE 2. In principle, any cards supporting PCIe, e.g., accelerators such as GPUs and other network technologies than InfiniBand, can be used as long as form factor and power specifications are met.
Networks
The intra-node communication proceeds via the PCIe switch without host CPU involvement. The inter-node communication is based on FDR InfiniBand. The topology of the InfiniBand network is a two-dimensional hyper-crossbar. This means that a two-dimensional mesh of InfiniBand switches is built, and the two InfiniBand ports of a node are connected to one switch in each of the dimensions. The hyper-crossbar topology was first introduced by the Japanese CP-PACS collaboration of particle physicists.The InfiniBand network is also used for I/O to a Lustre file system.
The CPU card provides two Gigabit Ethernet interfaces that are used to control the nodes and to boot the operating system.
Cooling
The nodes of the QPACE 2 supercomputer are cooled by water using an innovative concept based on roll-bond technology. Water flows through a roll-bond plate made of aluminum which is thermally coupled to the hot components via aluminum or copper interposers and thermal grease or thermal interface material. All components of the node are cooled in this way. The performance of the cooling concept allows for free cooling year-round.The power consumption of a node was measured to be up to 1400 Watt in synthetic benchmarks. Around 1000 Watt are needed for typical computations in lattice quantum chromodynamics.
System software
The diskless nodes are operated using a standard Linux distribution, which is booted over the Ethernet network. The Xeon Phis are running the freely available Intel Manycore Platform Software Stack.The InfiniBand communication is based on the OFED stack, which is freely available as well.