Recurrent neural network
In artificial neural networks, recurrent neural networks are designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which process inputs independently, RNNs utilize recurrent connections, where the output of a neuron at one time step is fed back as input to the network at the next time step. This enables RNNs to capture temporal dependencies and patterns within sequences.
The fundamental building block of RNN is the recurrent unit, which maintains a hidden state—a form of memory that is updated at each time step based on the current input and the previous hidden state. This feedback mechanism allows the network to learn from past inputs and incorporate that knowledge into its current processing. RNNs have been successfully applied to tasks such as unsegmented, connected handwriting recognition, speech recognition, natural language processing, and neural machine translation.
However, traditional RNNs suffer from the vanishing gradient problem, which limits their ability to learn long-range dependencies. This issue was addressed by the development of the long short-term memory architecture in 1997, making it the standard RNN variant for handling long-term dependencies. Later, gated recurrent units were introduced as a more computationally efficient alternative.
In recent years, transformers, which rely on self-attention mechanisms instead of recurrence, have become the dominant architecture for many sequence-processing tasks, particularly in natural language processing, due to their superior handling of long-range dependencies and greater parallelizability. Nevertheless, RNNs remain relevant for applications where computational efficiency, real-time processing, or the inherent sequential nature of data is crucial.
History
Before modern
One origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Cajal observed "recurrent semicircles" in the cerebellar cortex formed by parallel fiber, Purkinje cells, and granule cells. In 1933, Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex. During 1940s, multiple people proposed the existence of feedback in the brain, which was a contrast to the previous understanding of the neural system as a purely feedforward structure. Hebb considered "reverberating circuit" as an explanation for short-term memory. The McCulloch and Pitts paper, which proposed the McCulloch-Pitts neuron model, considered networks that contains cycles. The current activity of such networks can be affected by activity indefinitely far in the past. They were both interested in closed loops as possible explanations for e.g. epilepsy and causalgia. Recurrent inhibition was proposed in 1946 as a negative feedback mechanism in motor control. Neural feedback loops were a common topic of discussion at the Macy conferences. See for an extensive review of recurrent neural network models in neuroscience.Frank Rosenblatt in 1960 published "close-loop cross-coupled perceptrons", which are 3-layered perceptron networks whose middle layer contains recurrent connections that change by a Hebbian learning rule. Later, in Principles of Neurodynamics, he described "closed-loop cross-coupled" and "back-coupled" perceptron networks, and made theoretical and experimental studies for Hebbian learning in these networks, and noted that a fully cross-coupled perceptron network is equivalent to an infinitely deep feedforward network.
Similar networks were published by Kaoru Nakano in 1971,Shun'ichi Amari in 1972, and in 1974, who was acknowledged by Hopfield in his 1982 paper.
Another origin of RNN was statistical mechanics. The Ising model was developed by Wilhelm Lenz and Ernst Ising in the 1920s as a simple statistical mechanical model of magnets at equilibrium. Glauber in 1963 studied the Ising model evolving in time, as a process towards equilibrium, adding in the component of time.
The Sherrington–Kirkpatrick model of spin glass, published in 1975, is the Hopfield network with random initialization. Sherrington and Kirkpatrick found that it is highly likely for the energy function of the SK model to have many local minima. In the 1982 paper, Hopfield applied this recently developed theory to study the Hopfield network with binary activation functions. In a 1984 paper he extended this to continuous activation functions. It became a standard model for the study of neural networks through statistical mechanics.
Modern
Modern RNN networks are mainly based on two architectures: LSTM and BRNN.At the resurgence of neural networks in the 1980s, recurrent networks were studied again. They were sometimes called "iterated nets". Two early influential works were the Jordan network and the Elman network, which applied RNN to study cognitive psychology. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.
Long short-term memory networks were invented by Hochreiter and Schmidhuber in 1995 and set accuracy records in multiple applications domains. It became the default choice for RNN architecture.
Bidirectional recurrent neural networks use two RNNs that process the same input in opposite directions. These two are often combined, giving the bidirectional LSTM architecture.
Around 2006, bidirectional LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications. They also improved large-vocabulary speech recognition and text-to-speech synthesis and was used in Google voice search, and dictation on Android devices. They broke records for improved machine translation, language modeling and Multilingual Language Processing. Also, LSTM combined with convolutional neural networks improved automatic image captioning.
The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014. A seq2seq architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. They became state of the art in machine translation, and was instrumental in the development of attention mechanisms and transformers.
Configurations
An RNN-based model can be factored into two parts: configuration and architecture. Multiple RNNs can be combined in a data flow, and the data flow itself is the configuration. Each RNN itself may have any architecture, including LSTM, GRU, etc.Standard
RNNs come in many variants. Abstractly speaking, an RNN is a function of type, where- : input vector;
- : hidden vector;
- : output vector;
- : neural network parameters.
The illustration to the right may be misleading to many because practical neural network topologies are frequently organized in "layers" and the drawing gives that appearance. However, what appears to be layers are, in fact, different steps in time, "unfolded" to produce the appearance of layers.
Stacked RNN
A stacked RNN, or deep RNN, is composed of multiple RNNs stacked one above the other. Abstractly, it is structured as follows- Layer 1 has hidden vector, parameters, and maps.
- Layer 2 has hidden vector, parameters, and maps.
- ...
- Layer has hidden vector, parameters, and maps.
Bidirectional
A bidirectional RNN is composed of two RNNs, one processing the input sequence in one direction, and another in the opposite direction. Abstractly, it is structured as follows:- The forward RNN processes in one direction:
- The backward RNN processes in the opposite direction:
Bidirectional RNN allows the model to process a token both in the context of what came before it and what came after it. By stacking multiple bidirectional RNNs together, the model can process a token increasingly contextually. The ELMo model is a stacked bidirectional LSTM which takes character-level as inputs and produces word-level embeddings.