Tensor Processing Unit
Tensor Processing Unit is a neural processing unit application-specific integrated circuit developed by Google for neural network machine learning. Tensorflow, Jax, and PyTorch are supported frameworks for TPU. Google began using TPUs internally in 2015, and in 2018 made them available for third-party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.
Comparison to CPUs and GPUs
Compared to a graphics processing unit, TPUs are designed for a high volume of low precision computation with more input/output operations per joule, without hardware for rasterisation/texture mapping. The TPU ASICs are mounted in a heatsink assembly, which can fit in a hard drive slot within a data center rack, according to Norman Jouppi.Different types of processors are suited for different types of machine learning models. TPUs are well suited for convolutional neural networks, while GPUs have benefits for some fully connected neural networks, and CPUs can have advantages for recurrent neural networks.
History
In 2013, Google recruited Dr. Amir Salek to establish custom silicon development capabilities for the company's datacenters. As founder and head of Custom Silicon for Google Technical Infrastructure and Google Cloud, Salek led the development of the original TPU, TPUv2, TPUv3, TPUv4, Edge-TPU, and additional silicon products including the VCU, IPU, and OpenTitan.According to Jonathan Ross, one of the original TPU engineers, and later the founder of Groq, three separate groups at Google were developing AI accelerators, with the TPU, a systolic array, being the design that was ultimately selected.
Norman P. Jouppi served as the tech lead and principal architect for Google's Tensor Processing Unit development, leading the rapid design, verification, and deployment of the first TPU to production in just 15 months. As lead author of the seminal 2017 paper "In-Datacenter Performance Analysis of a Tensor Processing Unit," presented at the 44th International Symposium on Computer Architecture, Jouppi demonstrated that the TPU achieved 15–30× higher performance and 30–80× higher performance-per-watt than contemporary CPUs and GPUs, establishing the TPU as a foundational platform for neural network inference at scale across Google's production services.
The tensor processing unit was announced in May 2016 at the Google I/O conference, when the company said that the TPU had been used inside their data centers for over a year. Google's 2017 paper describing its creation cites previous systolic matrix multipliers of similar architecture built in the 1990s. The chip was specifically designed for Google's TensorFlow framework, a symbolic math library used for machine learning applications such as neural networks. However, as of 2017 Google still used CPUs and GPUs for other types of machine learning. Other AI accelerator designs are appearing from other vendors also and are aimed at embedded and robotics markets.
Google's TPUs are proprietary. Some models are commercially available, and on February 12, 2018, The New York Times reported that Google "would allow other companies to buy access to those chips through its cloud-computing service." Google has said that they were used in the AlphaGo versus Lee Sedol series of human-versus-machine Go games, as well as in the AlphaZero system, which produced Chess, Shogi and Go playing programs from the game rules alone and went on to beat the leading programs in those games. Google has also used TPUs for Google Street View text processing and was able to find all the text in the Street View database in less than five days. In Google Photos, an individual TPU can process over 100 million photos a day. It is also used in RankBrain which Google uses to provide search results.
Google provides third parties access to TPUs through its Cloud TPU service as part of the Google Cloud Platform and through its notebook-based services Kaggle and Colaboratory.
Broadcom is a co-developer of TPUs, translating Google's architecture and specifications into manufacturable silicon. It provides proprietary technologies such as SerDes high-speed interfaces, overseeing ASIC design, and managing chip fabrication and packaging through third-party foundries like Taiwan Semiconductor Manufacturing Company, covering all generations since the program's inception.
In September 2025, Google is in talks several "neoclouds," including Crusoe and CoreWeave, about deploying TPU in their datacenter. In November 2025, Meta is in talks with Google to deploy TPUs in its AI datacenters.
Products
| v1 | v2 | v3 | v4 | v5e | v5p | v6e | v7 | |
| Date introduced | 2015 | 2017 | 2018 | 2021 | 2023 | 2023 | 2024 | 2025 |
| Process node | 28 nm | 16 nm | 16 nm | 7 nm | Not listed | Not listed | Not listed | Not listed |
| Die size | 331 | < 625 | < 700 | < 400 | 300–350 | Not listed | Not listed | Not listed |
| On-chip memory | 28 | 32 | 32 + 5 | 128 + 32 + 10 | Not listed | Not listed | Not listed | Not listed |
| Clock speed | 700 | 700 | 940 | 1050 | Not listed | 1750 | Not listed | Not listed |
| Memory | 8 GiB DDR3 | 16 GiB HBM | 32 GiB HBM | 32 GiB HBM | 16 GB HBM | 95 GB HBM | 32 GB | 192 GB HBM |
| Memory bandwidth | 34 GB/s | 600 GB/s | 900 GB/s | 1200 GB/s | 819 GB/s | 2765 GB/s | 1640 GB/s | 7.37 TB/s |
| Thermal design power | 75 | 280 | 220 | 170 | Not listed | Not listed | Not listed | Not listed |
| Computational performance | 23 | 45 | 123 | 275 | 197 393 | 459 918 | 918 1836 | 4614 |
| Energy efficiency | Not listed | Not listed | Not listed | 4.7 |
First generation TPU
The first-generation TPU is an 8-bit matrix multiplication engine, driven with CISC instructions by the host processor across a PCIe 3.0 bus. It is manufactured on a 28 nm process with a die size ≤ 331 mm2. The clock speed is 700 MHz and it has a thermal design power of 28–40 W. It has 28 MiB of on chip memory, and 4 MiB of 32-bit accumulators taking the results of a 256×256 systolic array of 8-bit multipliers. Within the TPU package is 8 GiB of dual-channel 2133 MHz DDR3 SDRAM offering 34 GB/s of bandwidth. Instructions transfer data to or from the host, perform matrix multiplications or convolutions, and apply activation functions.Second generation TPU
The second-generation TPU was announced in May 2017. Google stated the first-generation TPU design was limited by memory bandwidth and using 16 GB of High Bandwidth Memory in the second-generation design increased bandwidth to 600 GB/s and performance to 45 teraFLOPS. The TPUs are then arranged into four-chip modules with a performance of 180 teraFLOPS. Then 64 of these modules are assembled into 256-chip pods with 11.5 petaFLOPS of performance. Notably, while the first-generation TPUs were limited to integers, the second-generation TPUs can also calculate in floating point, introducing the bfloat16 format invented by Google Brain. This makes the second-generation TPUs useful for both training and inference of machine learning models. Google has stated these second-generation TPUs will be available on the Google Compute Engine for use in TensorFlow applications.Third generation TPU
The third-generation TPU was announced on May 8, 2018. Google announced that processors themselves are twice as powerful as the second-generation TPUs, and would be deployed in pods with four times as many chips as the preceding generation. This results in an 8-fold increase in performance per pod compared to the second-generation TPU deployment.Fourth generation TPU
On May 18, 2021, Google CEO Sundar Pichai spoke about TPU v4 Tensor Processing Units during his keynote at the Google I/O virtual conference. TPU v4 improved performance by more than 2x over TPU v3 chips. Pichai said "A single v4 pod contains 4,096 v4 chips, and each pod has 10x the interconnect bandwidth per chip at scale, compared to any other networking technology.” An April 2023 paper by Google claims TPU v4 is 5–87% faster than a Nvidia A100 at machine learning benchmarks.There is also an "inference" version, called v4i, that does not require liquid cooling.
Fifth generation TPU
In 2021, Google revealed the physical layout of TPU v5 is being designed with the assistance of a novel application of deep reinforcement learning. Google claims TPU v5 is nearly twice as fast as TPU v4, and based on that and the relative performance of TPU v4 over A100, some speculate TPU v5 as being as fast as or faster than an H100.Similar to the v4i being a lighter-weight version of the v4, the fifth generation has a "cost-efficient" version called v5e. In December 2023, Google announced TPU v5p which is claimed to be competitive with the Nvidia H100.