Attention (machine learning)
In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network language translation system, but a more recent design, namely the transformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.
Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.
History
| 1950s–1960s | Psychology and biology of attention. Cocktail party effect — focusing on content by filtering out background noise. Filter model of attention, partial report paradigm, and saccade control. |
| 1980s | Sigma-pi units, higher-order neural networks. |
| 1990s | Fast weight controllers and dynamic links between neurons, anticipating key-value mechanisms in attention. |
| 1998 | The bilateral filter was introduced in image processing. It uses pairwise affinity matrices to propagate relevance across elements. |
| 2005 | Non-local means extended affinity-based filtering in image denoising, using Gaussian similarity kernels as fixed attention-like weights. |
| 2014 | seq2seq with RNN + Attention. Attention was introduced to enhance RNN encoder-decoder translation, particularly for long sentences. See Overview section. Attentional Neural Networks introduced a learned feature selection mechanism using top-down cognitive modulation, showing how attention weights can highlight relevant inputs. |
| 2015 | Attention was extended to vision for image captioning tasks. |
| 2016 | Self-attention was integrated into RNN-based models to capture intra-sequence dependencies. Self-attention was explored in decomposable attention models for natural language inference and structured self-attentive sentence embeddings. |
| 2017 | The Transformer architecture introduced in the research paper Attention is All You Need formalized scaled dot-product self-attention: Relation networks and set Transformers applied attention to unordered sets and relational reasoning, generalizing pairwise interaction models. |
| 2018 | Non-local neural networks extended attention to computer vision by capturing long-range dependencies in space and time. Graph attention networks applied attention mechanisms to graph-structured data. |
| 2019–2020 | Efficient Transformers, including Reformer, Linformer, and Performer, introduced scalable approximations of attention for long sequences. |
| 2019+ | Hopfield networks were reinterpreted as associative memory-based attention systems, and vision transformers achieved competitive results in image classification. Transformers were adopted across scientific domains, including AlphaFold for protein folding, CLIP for vision-language pretraining, and attention-based dense segmentation models like CCNet and DANet. |
Additional surveys of the attention mechanism in deep learning are provided by Niu et al. and Soydaner.
The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which replaced recurrence with attention mechanisms. As a result, Transformers became the foundation for models like BERT, T5 and generative pre-trained transformers.
Overview
The modern era of machine attention was revitalized by grafting an attention mechanism to an Encoder-Decoder.Figure 2 shows the internal step-by-step operation of the attention block in Fig 1. Interpreting attention weightsIn translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced.Consider an example of translating I love you to French. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime. In the I love you example, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t, and aime yields an alignment matrix:
Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights, as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector. VariantsMany variants of attention implement soft weights, such as
These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Overview section above.
OptimizationsFlash attentionThe size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency. |