Cray XMT
Cray XMT is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture, targeted at large graph problems. Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support. It brought various substantial improvements over Cray MTA-2, most notably nearly tripling the peak performance, and vastly increased maximum CPU count to 8,192 and maximum memory to 128 TB, with a data TLB of maximal 512 TB.
Cray XMT uses a scrambled content-addressable memory model on DDR1 ECC modules to implicitly load-balance memory access across the whole shared global address space of the system. Use of 4 additional Extended Memory Semantics bits per 64-bit memory word enables lightweight, fine-grained synchronization on all memory. There are no hardware interrupts and hardware threads are allocated by an instruction, not the OS.
Front-end and back-end communicate through the LUC interface, a RPC-style bidirectional client/server interface.
Threadstorm3
Threadstorm3 is a 64-bit single-core VLIW barrel processor with 128 hardware streams, onto each a software thread can be mapped, running at 500 MHz and using the MTA instruction set or a superset of it. It has a 128KB, 4-way associative data buffer. Each Threadstorm3 has 128 separate register sets and program counters, which are fairly fully context-switched at each cycle. Its estimated peak performance is 1.5 GFLOPS. It has 3 functional units, which receive operations from the same MTA instruction and operate within the same cycle. Each stream has 32 general-purpose registers, 8 target registers and a status word, containing the program counter. High-level control of job allocation across threads is not possible. Due to the MTA's pipeline length of 21, each stream is selected to execute instructions again no prior than 21 cycles later. The TDP of the processor package is 30 W.Due to their thread-level context switch at each cycle, performance of Threadstorm CPUs is not constrained by memory access time. In a simplified model, at each clock cycle an instruction from one of the threads is executed and another memory request is queued with the understanding that by the time the next round of execution is ready the requested data has arrived. This is contrary to many conventional architectures which stall on memory access. The architecture excels in data walking schemes where subsequent memory access cannot be easily predicted and thus wouldn't be well suited to a conventional cache model. Threadstorm's principal architect was Burton J. Smith.