Tensor Processing Unit


Tensor Processing Unit is a neural processing unit application-specific integrated circuit developed by Google for neural network machine learning. Tensorflow, Jax, and PyTorch are supported frameworks for TPU. Google began using TPUs internally in 2015, and in 2018 made them available for third-party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.

Comparison to CPUs and GPUs

Compared to a graphics processing unit, TPUs are designed for a high volume of low precision computation with more input/output operations per joule, without hardware for rasterisation/texture mapping. The TPU ASICs are mounted in a heatsink assembly, which can fit in a hard drive slot within a data center rack, according to Norman Jouppi.
Different types of processors are suited for different types of machine learning models. TPUs are well suited for convolutional neural networks, while GPUs have benefits for some fully connected neural networks, and CPUs can have advantages for recurrent neural networks.

History

In 2013, Google recruited Dr. Amir Salek to establish custom silicon development capabilities for the company's datacenters. As founder and head of Custom Silicon for Google Technical Infrastructure and Google Cloud, Salek led the development of the original TPU, TPUv2, TPUv3, TPUv4, Edge-TPU, and additional silicon products including the VCU, IPU, and OpenTitan.
According to Jonathan Ross, one of the original TPU engineers, and later the founder of Groq, three separate groups at Google were developing AI accelerators, with the TPU, a systolic array, being the design that was ultimately selected.
Norman P. Jouppi served as the tech lead and principal architect for Google's Tensor Processing Unit development, leading the rapid design, verification, and deployment of the first TPU to production in just 15 months. As lead author of the seminal 2017 paper "In-Datacenter Performance Analysis of a Tensor Processing Unit," presented at the 44th International Symposium on Computer Architecture, Jouppi demonstrated that the TPU achieved 15–30× higher performance and 30–80× higher performance-per-watt than contemporary CPUs and GPUs, establishing the TPU as a foundational platform for neural network inference at scale across Google's production services.
The tensor processing unit was announced in May 2016 at the Google I/O conference, when the company said that the TPU had been used inside their data centers for over a year. Google's 2017 paper describing its creation cites previous systolic matrix multipliers of similar architecture built in the 1990s. The chip was specifically designed for Google's TensorFlow framework, a symbolic math library used for machine learning applications such as neural networks. However, as of 2017 Google still used CPUs and GPUs for other types of machine learning. Other AI accelerator designs are appearing from other vendors also and are aimed at embedded and robotics markets.
Google's TPUs are proprietary. Some models are commercially available, and on February 12, 2018, The New York Times reported that Google "would allow other companies to buy access to those chips through its cloud-computing service." Google has said that they were used in the AlphaGo versus Lee Sedol series of human-versus-machine Go games, as well as in the AlphaZero system, which produced Chess, Shogi and Go playing programs from the game rules alone and went on to beat the leading programs in those games. Google has also used TPUs for Google Street View text processing and was able to find all the text in the Street View database in less than five days. In Google Photos, an individual TPU can process over 100 million photos a day. It is also used in RankBrain which Google uses to provide search results.
Google provides third parties access to TPUs through its Cloud TPU service as part of the Google Cloud Platform and through its notebook-based services Kaggle and Colaboratory.
Broadcom is a co-developer of TPUs, translating Google's architecture and specifications into manufacturable silicon. It provides proprietary technologies such as SerDes high-speed interfaces, overseeing ASIC design, and managing chip fabrication and packaging through third-party foundries like Taiwan Semiconductor Manufacturing Company, covering all generations since the program's inception.
In September 2025, Google is in talks several "neoclouds," including Crusoe and CoreWeave, about deploying TPU in their datacenter. In November 2025, Meta is in talks with Google to deploy TPUs in its AI datacenters.

Products

v1v2v3v4v5ev5pv6e v7
Date introduced20152017201820212023202320242025
Process node28 nm16 nm16 nm7 nmNot listedNot listedNot listedNot listed
Die size 331< 625< 700< 400300–350Not listedNot listedNot listed
On-chip memory 283232 + 5 128 + 32 + 10 Not listedNot listedNot listedNot listed
Clock speed 7007009401050Not listed1750Not listedNot listed
Memory8 GiB DDR316 GiB HBM32 GiB HBM32 GiB HBM16 GB HBM95 GB HBM32 GB192 GB HBM
Memory bandwidth34 GB/s600 GB/s900 GB/s1200 GB/s819 GB/s2765 GB/s1640 GB/s7.37 TB/s
Thermal design power 75280220170Not listedNot listedNot listedNot listed
Computational performance 2345123275197
393
459
918
918
1836
4614
Energy efficiency Not listedNot listedNot listed4.7

First generation TPU

The first-generation TPU is an 8-bit matrix multiplication engine, driven with CISC instructions by the host processor across a PCIe 3.0 bus. It is manufactured on a 28 nm process with a die size ≤ 331 mm2. The clock speed is 700 MHz and it has a thermal design power of 28–40 W. It has 28 MiB of on chip memory, and 4 MiB of 32-bit accumulators taking the results of a 256×256 systolic array of 8-bit multipliers. Within the TPU package is 8 GiB of dual-channel 2133 MHz DDR3 SDRAM offering 34 GB/s of bandwidth. Instructions transfer data to or from the host, perform matrix multiplications or convolutions, and apply activation functions.

Second generation TPU

The second-generation TPU was announced in May 2017. Google stated the first-generation TPU design was limited by memory bandwidth and using 16 GB of High Bandwidth Memory in the second-generation design increased bandwidth to 600 GB/s and performance to 45 teraFLOPS. The TPUs are then arranged into four-chip modules with a performance of 180 teraFLOPS. Then 64 of these modules are assembled into 256-chip pods with 11.5 petaFLOPS of performance. Notably, while the first-generation TPUs were limited to integers, the second-generation TPUs can also calculate in floating point, introducing the bfloat16 format invented by Google Brain. This makes the second-generation TPUs useful for both training and inference of machine learning models. Google has stated these second-generation TPUs will be available on the Google Compute Engine for use in TensorFlow applications.

Third generation TPU

The third-generation TPU was announced on May 8, 2018. Google announced that processors themselves are twice as powerful as the second-generation TPUs, and would be deployed in pods with four times as many chips as the preceding generation. This results in an 8-fold increase in performance per pod compared to the second-generation TPU deployment.

Fourth generation TPU

On May 18, 2021, Google CEO Sundar Pichai spoke about TPU v4 Tensor Processing Units during his keynote at the Google I/O virtual conference. TPU v4 improved performance by more than 2x over TPU v3 chips. Pichai said "A single v4 pod contains 4,096 v4 chips, and each pod has 10x the interconnect bandwidth per chip at scale, compared to any other networking technology.” An April 2023 paper by Google claims TPU v4 is 5–87% faster than a Nvidia A100 at machine learning benchmarks.
There is also an "inference" version, called v4i, that does not require liquid cooling.

Fifth generation TPU

In 2021, Google revealed the physical layout of TPU v5 is being designed with the assistance of a novel application of deep reinforcement learning. Google claims TPU v5 is nearly twice as fast as TPU v4, and based on that and the relative performance of TPU v4 over A100, some speculate TPU v5 as being as fast as or faster than an H100.
Similar to the v4i being a lighter-weight version of the v4, the fifth generation has a "cost-efficient" version called v5e. In December 2023, Google announced TPU v5p which is claimed to be competitive with the Nvidia H100.

Sixth generation TPU

In May 2024, at the Google I/O conference, Google announced Trillium, which became available in preview in October 2024. Google claimed a 4.7 times performance increase relative to TPU v5e, via larger matrix multiplication units and an increased clock speed. High bandwidth memory capacity and bandwidth have also doubled. A pod can contain up to 256 Trillium units.

Seventh generation TPU

In April 2025, at Google Cloud Next conference, Google unveiled TPU v7. This new chip, called Ironwood, will come in two configurations: a 256-chip cluster and a 9,216-chip cluster. Ironwood will have a peak computational performance rate of 4,614 TFLOP/s.