Bulldozer (microarchitecture)
The AMD Bulldozer Family 15h is a microprocessor microarchitecture for the FX and Opteron line of processors, developed by AMD for the desktop and server markets. Bulldozer is the codename for this family of microarchitectures. It was released on October 12, 2011, as the successor to the K10 microarchitecture.
Bulldozer is designed from scratch, not a development of earlier processors. The core is specifically aimed at computing products with TDPs of 10 to 125 watts. AMD claims dramatic performance-per-watt efficiency improvements in high-performance computing applications with Bulldozer cores.
The Bulldozer cores support most of the instruction sets implemented by Intel processors available at its introduction as well as new instruction sets proposed by AMD; ABM, XOP, FMA4 and F16C. Only Bulldozer GEN4 supports AVX2 instruction sets.
Overview
According to AMD, Bulldozer-based CPUs are based on GlobalFoundries' 32 nm Silicon on insulator process technology and reuses the approach of DEC for multitasking computer performance with the arguments that it, according to press notes, "balances dedicated and shared computer resources to provide a highly compact, high units count design that is easily replicated on a chip for performance scaling." In other words, by eliminating some of the "redundant" elements that naturally creep into multicore designs, AMD has hoped to take better advantage of its hardware capabilities, while using less power.Bulldozer-based implementations built on 32nm SOI with HKMG arrived in October 2011 for both servers and desktops. The server segment included the dual chip Opteron processor codenamed Interlagos and single chip Valencia, while the Zambezi targeted desktops on Socket AM3+.
Bulldozer is the first major redesign of AMD’s processor architecture since 2003, when the firm launched its K8 processors, and also features two 128-bit FMA-capable FPUs which can be combined into one 256-bit FPU. This design is accompanied by two integer clusters, each with 4 pipelines. Bulldozer also introduced shared L2 cache in the new architecture. AMD calls this design a "Module". A 16-core processor design would feature eight of these "modules", but the operating system will recognize each "module" as two logical cores.
The modular architecture consists of multithreaded shared L2 cache and FlexFPU, which uses simultaneous multithreading. Each physical integer core, two per module, is single threaded, in contrast with Intel's Hyperthreading, where two virtual simultaneous threads share the resources of a single physical core.
In a retrospective review, Jeremy Laird of APC magazine commented on Bulldozer issues, noted that it was slower than the outgoing Phenom II K10 design, and that the PC software ecosystem had not yet "embraced" the multi-threaded model. By his observation, these issues caused a substantial loss for AMD, and the company lost over 1 billion USD in 2012. Some industry observers were predicting the bankruptcy of AMD by mid-2015. The company later managed to return to profitability. Mentioned reasons for regaining profitability were the earlier divesting of in-house manufacturing into GlobalFoundries, the outsourcing of manufacturing to TSMC and the creation of a new Ryzen CPU design.
Architecture
Bulldozer core
Bulldozer made use of "Clustered Multithreading", a technique where some parts of the processor are shared between two threads and some parts are unique for each thread. Prior examples of such an approach to unconventional multithreading can be traced way back to the 2005 Sun Microsystems' UltraSPARC T1 CPU.In terms of hardware complexity and functionality, a Bulldozer CMT module is equal to a dual-core processor in its integer calculation capabilities, and to either a single-core processor or a handicapped dual-core in terms of floating-point computational power, depending on whether the code is saturated in floating point instructions in both threads running on the same CMT module, and whether the FPU is performing 128-bit or 256-bit floating point operations. The reason for this is that for each two integer cores, that is, within the same module, there is a single floating-point unit consisting of a pair of 128-bit FMAC execution units.
CMT is in some way a simpler but similar design philosophy to SMT; both designs try to utilize execution units efficiently; in either method, when two threads compete for some execution pipelines, there is a loss in performance in one or more of the threads. Due to dedicated integer cores, the Bulldozer family modules performed roughly like a dual-core, dual-threaded processor during sections of code that were either wholly integer or a mix of integer and floating-point calculations; yet, due to the SMT use of the shared floating-point pipelines, the module would perform similarly to a single-core, dual-threaded SMT processor for a pair of threads saturated with floating-point instructions.
Both CMT and SMT are at peak effectiveness while running integer and floating point code on a pair of threads. CMT stays at peak effectiveness while working on a pair of threads consisting both of integer code, while under SMT, one or both threads will underperform due to competition for integer execution units. The disadvantage for CMT is a greater number of idle integer execution units in a single threaded case. In the single threaded case, CMT is limited to use at most half of the integer execution units in its module, while SMT imposes no such limit. A large SMT core with integer circuitry as wide and fast as two CMT cores could in theory have momentarily up to twice an integer performance in a single thread case.
CMT processors and a typical SMT processor are similar in their efficient shared use of the L2 cache between a pair of threads.
- A module consists of a coupling of two "conventional" x86 out of order processing cores. The processing core shares the early pipeline stages, the FPUs, and the L2 cache with the rest of the module.
- * Each module has the following independent hardware resources:
- * 16 KB 4-way of L1d per core and 2-way 64 KB of L1i per module, one way for each of the two cores
- * 2 MB of L2 cache per module
- *Write Coalescing Cache is a special cache that is part of L2 cache in Bulldozer microarchitecture. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced. The WCC's task is reducing number of writes to the L2 cache.
- * Two dedicated integer cores
- ** – each one includes two ALU and two AGU which are capable of a total of four independent arithmetic and memory operations per clock and per core
- ** – duplicating integer schedulers and execution pipelines offers dedicated hardware to each of two threads which double performance for multi-threaded integer loads
- ** – the second integer core in the module increases the Bulldozer module die by around 12%, which at chip level adds about 5% of total die space
- * Two symmetrical 128-bit FMAC floating-point pipelines per module that can be unified into one large 256-bit-wide unit if one of the integer cores dispatches AVX instruction and two symmetrical x87/MMX/SSE capable FPPs for backward compatibility with SSE2 non-optimized software. Each FMAC unit is also capable of division and square root operations with variable latency.
- All modules present share the L3 cache as well as an Advanced Dual-Channel Memory Sub-System.
- A module has 213 million transistors in an area of 30.9 mm2 on an Orochi die.
- The pipeline depth of Bulldozer is 20 cycles, compared to 12 cycles of the K10 core predecessor.
- The width of the Bulldozer integer core, four, is somewhat less than the width of the K10 core, six. Bobcat and Jaguar also used a four wide integer core, yet with lighter execution units: 1 ALU, 1 simple ALU, 1 load AGU, 1 store AGU.
[Branch predictor]
- Two-level Branch Target Buffer
- Hybrid predictor for conditionals
- Indirect predictor
Instruction set extensions
- Support for Intel's Advanced Vector Extensions instruction set, which supports 256-Bit floating point operations, and SSE4.1, SSE4.2, AES, CLMUL, as well as future 128-bit instruction sets proposed by AMD, which have the same functionality as the SSE5 instruction set formerly proposed by AMD, but with compatibility to the AVX coding scheme.
- Bulldozer GEN4 supports AVX2 instruction sets.
Process technology and clock frequency
- 11-metal layer 32 nm SOI process with implemented first generation GlobalFoundries's High-K Metal Gate
- Turbo Core 2 performance boost to increase clock frequency up to 500 MHz with all threads active and up to 1 GHz with the half of the thread active, within the TDP limit.
- The chip operates at 0.775 to 1.425 V, achieving clock frequencies of 3.6 GHz or more
- Min-Max TDP: 25 – 140 watts
Cache and memory interface
- Up to 8 MB of L3 shared among all cores on the same silicon die, divided into four subcaches of 2 MB each, capable of operating at 2.2 GHz at 1.1125 V
- Native DDR3 memory support up to DDR3-1866
- Dual Channel DDR3 integrated memory controller for Desktop and Server/Workstation Opteron 42xx "Valencia"; Quad Channel DDR3 Integrated Memory Controller for Server/Workstation Opteron 62xx "Interlagos"
- AMD claims support for two DIMMs of DDR3-1600 per channel. Two DIMMs of DDR3-1866 on a single channel will be down-clocked to 1600.