Graphics Core Next
Graphics Core Next is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.
GCN is a reduced instruction set SIMD microarchitecture contrasting the very long instruction word SIMD architecture of TeraScale. GCN requires considerably more transistors than TeraScale, but offers advantages for general-purpose GPU computation due to a simpler compiler.
GCN graphics chips were fabricated with CMOS at 28 nm, and with FinFET at 14 nm and 7 nm, available on selected models in AMD's Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of graphics cards, including the separately released Radeon VII. GCN was also used in the graphics portion of Accelerated Processing Units, including those in the PlayStation 4 and Xbox One.
GCN was succeeded by the RDNA microarchitecture and instruction set architecture in 2019.
Instruction set
The GCN instruction set is owned by AMD and was developed specifically for GPUs. It has no micro-operation for division.Documentation is available for:
- the ,
- the ,
- the ,
- the , and
- the .
GNU Compiler Collection 9 supports GCN 3 and GCN 5 since 2019 for single-threaded, stand-alone programs, with GCC 10 also offloading via OpenMP and OpenACC.
MIAOW is an open-source RTL implementation of the AMD Southern Islands GPGPU microarchitecture.
In November 2015, AMD announced its Boltzmann Initiative, which aims to enable the porting of CUDA-based applications to a common C++ programming model.
At the Super Computing 15 event, AMD displayed a Heterogeneous Compute Compiler, a headless Linux driver and HSA runtime infrastructure for cluster-class high-performance computing, and a Heterogeneous-compute Interface for Portability tool for porting CUDA applications to the aforementioned common C++ model.
Microarchitectures
As of July 2017, the Graphics Core Next instruction set has seen five iterations. The differences between the first four generations are rather minimal, but the fifth-generation GCN architecture features heavily modified stream processors to improve performance and support the simultaneous processing of two lower-precision numbers in place of a single higher-precision number.Command processing
Graphics Command Processor
The Graphics Command Processor is a functional unit of the GCN microarchitecture. Among other tasks, it is responsible for the handling of asynchronous shaders.Asynchronous Compute Engine
The Asynchronous Compute Engine is a distinct functional block serving computing purposes, whose purpose is similar to that of the Graphics Command Processor.Schedulers
Since the third iteration of GCN, the hardware contains two schedulers: one to schedule "wavefronts" during shader execution and the other to schedule execution of draw and compute queues. The latter helps performance by executing compute operations when the compute units are underutilized due to graphics commands limited by fixed function pipeline speed or bandwidth. This functionality is known as Async Compute.For a given shader, the GPU drivers may also schedule instructions on the CPU to minimize latency.
Geometric processor
The geometry processor contains a Geometry Assembler, a Tesselator, and a Vertex Assembler.The Tesselator is capable of doing tessellation in hardware as defined by Direct3D 11 and OpenGL 4.6, and succeeded ATI TruForm and hardware tessellation in TeraScale as AMD's then-latest semiconductor intellectual property core.
Compute units
One compute unit combines 64 shader processors with 4 texture mapping units. The compute units are separate from, but feed into, the render output units. Each compute unit consists of the following:- a CU scheduler
- a Branch & Message Unit
- 4 16-lane-wide SIMD Vector Units
- 4 64 KiB vector general-purpose register files
- 1 scalar unit
- a 8 KiB scalar GPR file
- a local data share of 64 KiB
- 4 Texture Filter Units
- 16 Texture Fetch Load/Store Units
- a 16 KiB level 1 cache
Every SIMD-VU has some private memory where it stores its registers. There are two types of registers: scalar registers, which hold 4 bytes number each, and vector registers, which each represent a set of 64 4-byte numbers. On the vector registers, every operation is done in parallel on the 64 numbers. which correspond to 64 inputs. For example, it may work on 64 different pixels at a time.
Every SIMD-VU has room for 512 scalar registers and 256 vector registers.
AMD has claimed that each GCN compute unit has 64 KiB Local Data Share.
CU scheduler
The CU scheduler is the hardware functional block, choosing which wavefronts the SIMD-VU executes. It picks one SIMD-VU per cycle for scheduling. This is not to be confused with other hardware or software schedulers.Wavefront
A shader is a small program written in GLSL that performs graphics processing, and a kernel is a small program written in OpenCL that performs GPGPU processing. These processes don't need that many registers, but they do need to load data from system or graphics memory. This operation comes with significant latency. AMD and Nvidia chose similar approaches to hide this unavoidable latency: the grouping of multiple threads. AMD calls such a group a "wavefront", whereas Nvidia calls it a "warp". A group of threads is the most basic unit of scheduling of GPUs that implement this approach to hide latency. It is the minimum size of the data processed in SIMD fashion, the smallest executable unit of code, and the way to processes a single instruction over all of the threads in it at the same time.In all GCN GPUs, a "wavefront" consists of 64 threads, and in all Nvidia GPUs, a "warp" consists of 32 threads.
AMD's solution is to attribute multiple wavefronts to each SIMD-VU. The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some result, which lies in memory, the CU Scheduler assigns the SIMD-VU another wavefront. Wavefronts are attributed per SIMD-VU. SIMD-VUs do not exchange wavefronts. A maximum of 10 wavefronts can be attributed per SIMD-VU.
AMD CodeXL shows tables with the relationship between number of SGPRs and VGPRs to the number of wavefronts, but essentially, for SGPRS it is between 104 and 512 per number of wavefronts, and for VGPRS it is 256 per number of wavefronts.
Note that in conjunction with the SSE instructions, this concept of the most basic level of parallelism is often called a "vector width". The vector width is characterized by the total number of bits in it.
SIMD Vector Unit
Each SIMD Vector Unit has:- a 16-lane integer and floating point vector Arithmetic Logic Unit
- 64 KiB Vector General Purpose Register file
- 10× 48-bit Program Counters
- Instruction buffer for 10 wavefronts
- A 64-thread wavefront issues to a 16-lane SIMD Unit over four cycles
Audio and video acceleration blocks
Many implementations of GCN are typically accompanied by several of AMD's other ASIC blocks. Including but not limited to the Unified Video Decoder, Video Coding Engine, and AMD TrueAudio.Video Coding Engine
The Video Coding Engine is a video encoding ASIC, first introduced with the Radeon HD 7000 series.The initial version of the VCE added support for encoding I and P frames H.264 in the YUV420 pixel format, along with SVE temporal encode and Display Encode Mode, while the second version added B-frame support for YUV420 and YUV444 I-frames.
VCE 3.0 formed a part of the third generation of GCN, adding high-quality video scaling and the HEVC codec.
VCE 4.0 was part of the Vega architecture, and was subsequently succeeded by Video Core Next.
TrueAudio
Unified virtual memory
In a preview in 2011, AnandTech wrote about the unified virtual memory, supported by Graphics Core Next.Heterogeneous System Architecture (HSA)
Some of the specific HSA features implemented in the hardware need support from the operating system's kernel and/or from specific device drivers. For example, in July 2014, AMD published a set of 83 patches to be merged into Linux kernel mainline 3.17 for supporting their Graphics Core Next-based Radeon graphics cards. The so-called HSA kernel driver resides in the directory, while the DRM graphics device drivers reside in and augment the already existing DRM drivers for Radeon cards. This very first implementation focuses on a single "Kaveri" APU and works alongside the existing Radeon kernel graphics driver.Lossless Delta Color Compression
Hardware schedulers
Hardware schedulers are used to perform scheduling and offload the assignment of compute queues to the ACEs from the driver to hardware, by buffering these queues until there is at least one empty queue in at least one ACE. This causes the HWS to immediately assign buffered queues to the ACEs until all queues are full or there are no more queues to safely assign.Part of the scheduling work performed includes prioritized queues which allow critical tasks to run at a higher priority than other tasks without requiring the lower priority tasks to be preempted to run the high priority task, therefore allowing the tasks to run concurrently with the high priority tasks scheduled to hog the GPU as much as possible while letting other tasks use the resources that the high priority tasks are not using. These are essentially Asynchronous Compute Engines that lack dispatch controllers. They were first introduced in the fourth generation GCN microarchitecture, but were present in the third generation GCN microarchitecture for internal testing purposes. A driver update has enabled the hardware schedulers in third generation GCN parts for production use.