NVLink
NVLink is a wire-based serial, multi-lane, near-range, communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices can use mesh networking to communicate instead of a central hub/switch. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect.
For small numbers of GPUs, the NVLink lanes on a single device are sufficient for an all-to-all mesh connectivity. To accommodate higher GPU counts, NVLink since 2018 use a packet-switched architecture, where a central switch can serve up to 32 two-lane ports. The NVSwitch for NVLink 4.0 can produce some simple computation of its own to reduce the need for communication thanks to the "SHARP" accelerator.
Principle
NVLink is developed by Nvidia for data and control code transfers in processor systems between CPUs and GPUs and between GPUs and GPUs. NVLink specifies a point-to-point connection with data rates of 20, 25, and 50 Gbit/s per differential pair. For NVLink 1.0 and 2.0, eight differential pairs form a "sub-link", and two "sub-links", one for each direction, form a "link". Starting from NVlink 3.0, only four differential pairs form a "sub-link". For NVLink 2.0 and higher, the total data rate for a sub-link is 25 GB/s, and the total data rate for a link is 50 GB/s. Each V100 GPU supports up to six links. Thus, each GPU is capable of supporting up to 300 GB/s in total bi-directional bandwidth. NVLink products introduced to date focus on the high-performance application space. Announced May 14, 2020, NVLink 3.0 increases the data rate per differential pair from 25 Gbit/s to 50 Gbit/s while decreasing the number of pairs per NVLink from 8 to 4. With 12 links for an Ampere-based A100 GPU, this brings the total bandwidth to 600 GB/s. The Hopper GPU microarchitecture, announced in March 2022, has 18 NVLink 4.0 links, enabling a total bandwidth of 900 GB/s. Thus, NVLink 2.0, 3.0, and 4.0 all have a 50 GB/s per bidirectional link data rate, and have 6, 12, and 18 links, correspondingly.Performance
The following table shows a basic metrics comparison based on standard specifications:| Interconnect | Transfer rate | Line code | Modulation | Effective payload rate per lane or NVLink | Max. total lane length | Total Links | Total Bandwidth | Realized in design |
| PCIe 3.x | 8 GT/s | 128b/130b | NRZ | 0.99 GB/s | 31.51 GB/s | Pascal, Volta, Turing | ||
| PCIe 4.0 | 16 GT/s | 128b/130b | NRZ | 1.97 GB/s | 63.02 GB/s | Volta on Xavier, Ampere, POWER9 | ||
| PCIe 5.0 | 32 GT/s | 128b/130b | NRZ | 3.94 GB/s | 126.03 GB/s | Hopper | ||
| PCIe 6.0 | 64 GT/s | 236B/256B FLIT | PAM4 FEC | 7.56 GB/s | 242 GB/s | Blackwell | ||
| NVLink 1.0 | 20 GT/s | NRZ | 20 GB/s | 4 | 160 GB/s | Pascal, POWER8+ | ||
| NVLink 2.0 | 25 GT/s | NRZ | 25 GB/s | 6 | 300 GB/s | Volta, POWER9 | ||
| NVLink 3.0 | 50 GT/s | NRZ | 25 GB/s | 12 | 600 GB/s | Ampere | ||
| NVLink 4.0 | 50 GT/s | PAM4 differential-pair | 25 GB/s | 18 | 900 GB/s | Hopper, Nvidia Grace | ||
| NVLink 5.0 | 100 GT/s | PAM4 differential-pair | 50 GB/s | 18 | 1800 GB/s | Blackwell, Nvidia Grace |
The following table shows a comparison of relevant bus parameters for real world semiconductors offering NVLink as one of their options:
| Semiconductor | Board/bus delivery variant | Interconnect | Transmission technology rate | Lanes per sub-link | Sub-link data rate | Sub-link or unit count | Total data rate | Total lanes | Total data rate |
| Nvidia GP100 | P100 SXM, P100 PCI-E | PCIe 3.0 | 8 GT/s | 16 + 16 | 128 Gbit/s = 16 GB/s | 1 | 16 + 16 GB/s | 32 | 32 GB/s |
| Nvidia GV100 | V100 SXM2, V100 PCI-E | PCIe 3.0 | 8 GT/s | 16 + 16 | 128 Gbit/s = 16 GB/s | 1 | 16 + 16 GB/s | 32 | 32 GB/s |
| Nvidia TU104 | GeForce RTX 2080, Quadro RTX 5000 | PCIe 3.0 | 8 GT/s | 16 + 16 | 128 Gbit/s = 16 GB/s | 1 | 16 + 16 GB/s | 32 | 32 GB/s |
| Nvidia TU102 | GeForce RTX 2080 Ti, Quadro RTX 6000/8000 | PCIe 3.0 | 8 GT/s | 16 + 16 | 128 Gbit/s = 16 GB/s | 1 | 16 + 16 GB/s | 32 | 32 GB/s |
| Nvidia GA100 Nvidia GA102 | Ampere A100 | PCIe 4.0 | 16 GT/s | 16 + 16 | 256 Gbit/s = 32 GB/s | 1 | 32 + 32 GB/s | 32 | 64 GB/s |
| Nvidia GP100 | P100 SXM, | NVLink 1.0 | 20 GT/s | 8 + 8 | 160 Gbit/s = 20 GB/s | 4 | 80 + 80 GB/s | 64 | 160 GB/s |
| Nvidia GV100 | V100 SXM2 | NVLink 2.0 | 25 GT/s | 8 + 8 | 200 Gbit/s = 25 GB/s | 6 | 150 + 150 GB/s | 96 | 300 GB/s |
| Nvidia TU104 | GeForce RTX 2080, Quadro RTX 5000 | NVLink 2.0 | 25 GT/s | 8 + 8 | 200 Gbit/s = 25 GB/s | 1 | 25 + 25 GB/s | 16 | 50 GB/s |
| Nvidia TU102 | GeForce RTX 2080 Ti, Quadro RTX 6000/8000 | NVLink 2.0 | 25 GT/s | 8 + 8 | 200 Gbit/s = 25 GB/s | 2 | 50 + 50 GB/s | 32 | 100 GB/s |
| Nvidia GA100 | Ampere A100 | NVLink 3.0 | 50 GT/s | 4 + 4 | 200 Gbit/s = 25 GB/s | 12 | 300 + 300 GB/s | 96 | 600 GB/s |
| Nvidia GA102 | GeForce RTX 3090, Quadro RTX A6000 | NVLink 3.0 | 28.125 GT/s | 4 + 4 | 112.5 Gbit/s = 14.0625 GB/s | 4 | 56.25 + 56.25 GB/s | 16 | 112.5 GB/s |
| NVSwitch for Hopper | NVLink 4.0 | 106.25 GT/s | 9 + 9 | 450 Gbit/s | 18 | 3600 + 3600 GB/s | 128 | 7200 GB/s | |
| Nvidia Grace CPU | Nvidia GH200 Superchip | PCIe-5 @ 512 GB/s | - | - | - | - | - | - | - |
| Nvidia Grace CPU | Nvidia GH200 Superchip | NVLink-C2C @ 900 GB/s | - | - | - | - | - | - | - |
| Nvidia Hopper GPU | Nvidia GH200 Superchip | NVLink-C2C @ 900 GB/s | - | - | - | - | - | - | - |
| Nvidia Hopper GPU | Nvidia GH200 Superchip | NVLink 4 @ 900 GB/s | - | - | - | - | - | - | - |
Real world performance could be determined by applying different data transmission overhead costs, as well as usage rates. Those come from various sources:
- 128b/130b line code
- Link control characters
- Transaction header
- Buffering capabilities
- DMA usage on computer side
Usage with plug-in boards
For the various versions of plug-in boards that expose extra connectors for joining them into a NVLink group, a similar number of slightly varying, relatively compact, PCB based interconnection plugs does exist. Typically only boards of the same type will mate together due to their physical and logical design. For some setups two identical plugs need to be applied for achieving the full data rate. As of now the typical plug is U-shaped with a fine grid edge connector on each of the end strokes of the shape facing away from the viewer. The width of the plug determines how far away the plug-in cards need to be seated to the main board of the hosting computer system - a distance for the placement of the card is commonly determined by the matching plug. The interconnect is often referred as Scalable Link Interface from 2004 for its structural design and appearance, even if the modern NVLink based design is of a quite different technical nature with different features in its basic levels compared to the former design. Reported real world devices are:- Quadro GP100
- Quadro GV100
- GeForce RTX 2080 based on TU104
- GeForce RTX 2080 Ti based on TU102
- GeForce RTX 3090 based on GA102
- Quadro RTX 5000 based on TU104
- Quadro RTX 6000 based on TU102
- Quadro RTX 8000 based on TU102
Service software and programming