General-purpose computing on graphics processing units
General-purpose computing on graphics processing units is the use of a graphics processing unit, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit. The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
Essentially, a GPGPU pipeline is a kind of parallel processing between one or more GPUs and CPUs, with special accelerated instructions for processing image or other graphic forms of data. While GPUs operate at lower frequencies, they typically have many times the number of Processing elements. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into parallel form and then using the GPU to process it can create a large speedup.
GPGPU pipelines were developed at the beginning of the 21st century for graphics processing. From the history of supercomputing it is well-known that scientific computing drives the largest concentrations of Computing power in history, listed in the TOP500: the majority today utilize GPUs.
The best-known GPGPUs are Nvidia Tesla that are used for Nvidia DGX, alongside AMD Instinct and Intel Gaudi.
History
In principle, any arbitrary Boolean function, including addition, multiplication, and other mathematical functions, can be built up from a functionally complete set of logic operators. In 1987, Conway's Game of Life became one of the first examples of general-purpose computing using an early stream processor called a blitter to invoke a special sequence of logical operations on bit vectors.General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable shaders and floating point support on graphics processors. Notably, problems involving matrices and/or vectors especially two-, three-, or four-dimensional vectors were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs. These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, OpenGL and Direct3D. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as Sh/RapidMind, Brook and Accelerator.
These were followed by Nvidia's CUDA, which allowed programmers to ignore the underlying graphical concepts in favor of more common high-performance computing concepts. Newer, hardware-vendor-independent offerings include Microsoft's DirectCompute and Apple/Khronos Group's OpenCL. This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.
Mark Harris, the founder of GPGPU.org, claims he coined the term GPGPU.
Implementations
Software libraries and APIs
Any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework. Programming standards for parallel computing include OpenCL, OpenACC, OpenMP and OpenHMPP., OpenCL is the dominant open general-purpose GPU computing language, and is an open standard defined by the Khronos Group. OpenCL provides a cross-platform GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also standardised and implemented SYCL, a higher-level programming model for OpenCL as a single-source domain specific embedded language based on pure C++11.
The dominant proprietary framework is Nvidia CUDA. Nvidia launched CUDA in 2006, a software development kit and application programming interface that allows using the programming language C to code algorithms for execution on GeForce 8 series and later GPUs.
ROCm, launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features, and still lacking in consumer support.
OpenVIDIA was developed at University of Toronto between 2003–2005, in collaboration with Nvidia.
Altimesh Hybridizer created by Altimesh compiles Common Intermediate Language to CUDA binaries. It supports generics and virtual functions. Debugging and profiling is integrated with Visual Studio and Nsight. It is available as a Visual Studio extension on Visual Studio Marketplace.
Microsoft introduced the DirectCompute GPU computing API, released with the Direct3D 11 API.
, created by QuantAlea, introduces native GPU computing capabilities for the Microsoft.NET languages F# and C#. Alea GPU also provides a simplified GPU programming model based on GPU parallel-for and parallel aggregate using delegates and automatic memory management.
MATLAB supports GPGPU acceleration using the Parallel Computing Toolbox and MATLAB Distributed Computing Server, and third-party packages like Jacket.
GPGPU processing is also used to simulate Newtonian physics by physics engines, and commercial implementations include Havok Physics, FX and PhysX, both of which are typically used for computer and video games.
C++ Accelerated Massive Parallelism is a library that accelerates execution of C++ code by exploiting the data-parallel hardware on GPUs.
Mobile computers
Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major mobile operating systems.Google Android 4.2 enabled running RenderScript code on the mobile device GPU. Renderscript has since been deprecated in favour of first OpenGL compute shaders and later Vulkan Compute. OpenCL is available on many Android devices, but is not officially supported by Android. Apple introduced the proprietary Metal API for iOS applications, able to execute arbitrary code through Apple's GPU compute shaders.
GPU vs. CPU
Originally, data was simply passed one-way from a central processing unit to a graphics processing unit, then to a display device. As time progressed, however, it became valuable for GPUs to store at first simple, then complex structures of data to be passed back to the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format that a video card can understand. Because the GPU has access to every draw operation, it can analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much more slowly, as the speed of access between a CPU and its larger pool of random-access memory is slower than GPUs and video cards, which typically contain smaller amounts of more expensive memory that is much faster to access. Transferring the portion of the data set to be actively analyzed to that GPU memory in the form of textures or other easily readable GPU forms results in speed increase. The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in a multiplier effect on the speed of a specific high-use algorithm.GPGPU pipelines may improve efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in complex graphics pipelines as well as scientific computing; more so in fields with large data sets like genome mapping, or where two- or three-dimensional analysis is useful especially at present biomolecule analysis, protein study, and other complex organic chemistry. An example of such applications is NVIDIA software suite for genome analysis.
Such pipelines can also vastly improve efficiency in image processing and computer vision, among other fields; as well as parallel processing generally. Some very heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task.
A simple example would be a GPU program that collects data about average lighting values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use edge detection to return both numerical information and a processed image representing outlines to a computer vision program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every pixel or other picture element in an image, it can analyze and average it or apply a Sobel edge filter or other convolution filter with much greater speed than a CPU, which typically must access slower random-access memory copies of the graphic in question.
GPGPU as a software concept is a type of algorithm, not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing, which adds a third layer many computing units each using many CPUs to correspond to many GPUs. Some Bitcoin "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at the TOP500 supercomputer list.