Catastrophic interference
Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information.
Neural networks are an important part of the connectionist approach to cognitive science. The issue of catastrophic interference when modeling human memory with connectionist models was originally brought to the attention of the scientific community by research from McCloskey and Cohen, and Ratcliff. It is a radical manifestation of the 'sensitivity-stability' dilemma or the 'stability-plasticity' dilemma. Specifically, these problems refer to the challenge of making an artificial neural network that is sensitive to, but not disrupted by, new information.
Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general principles, from new inputs. On the other hand, connectionist networks like the standard backpropagation network can generalize to unseen inputs, but they are sensitive to new information. Backpropagation models can be analogized to human memory insofar as they have a similar ability to generalize, but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is an issue when modelling human memory, because unlike these networks, humans typically do not show catastrophic forgetting.Discovery
The term catastrophic interference was originally coined by McCloskey and Cohen but was also brought to the attention of the scientific community by research from Ratcliff.''The Sequential Learning Problem'': McCloskey and Cohen (1989)
McCloskey and Cohen noted the problem of catastrophic interference during two different experiments with backpropagation neural network modelling.
- Experiment 1: Learning the ones and twos addition facts
In their first experiment they trained a standard backpropagation neural network on a single training set consisting of 17 single-digit ones problems until the network could represent and respond properly to all of them. The error between the actual output and the desired output steadily declined across training sessions, which reflected that the network learned to represent the target outputs better across trials. Next, they trained the network on a single training set consisting of 17 single-digit twos problems until the network could represent, respond properly to all of them. They noted that their procedure was similar to how a child would learn their addition facts. Following each learning trial on the twos facts, the network was tested for its knowledge on both the ones and twos addition facts. Like the ones facts, the twos facts were readily learned by the network. However, McCloskey and Cohen noted the network was no longer able to properly answer the ones addition problems even after one learning trial of the twos addition problems. The output pattern produced in response to the ones facts often resembled an output pattern for an incorrect number more closely than the output pattern for a correct number. This is considered to be a drastic amount of error. Furthermore, the problems 2+1 and 1+2, which were included in both training sets, even showed dramatic disruption during the first learning trials of the twos facts.
- Experiment 2: Replication of Barnes and Underwood study
In their second connectionist model, McCloskey and Cohen attempted to replicate the study on retroactive interference in humans by Barnes and Underwood. They trained the model on A-B and A-C lists and used a context pattern in the input vector, to differentiate between the lists. Specifically the network was trained to respond with the right B response when shown the A stimulus and A-B context pattern and to respond with the correct C response when shown the A stimulus and the A-C context pattern. When the model was trained concurrently on the A-B and A-C items then the network readily learned all of the associations correctly. In sequential training the A-B list was trained first, followed by the A-C list. After each presentation of the A-C list, performance was measured for both the A-B and A-C lists. They found that the amount of training on the A-C list in Barnes and Underwood study that lead to 50% correct responses, lead to nearly 0% correct responses by the backpropagation network. Furthermore, they found that the network tended to show responses that looked like the C response pattern when the network was prompted to give the B response pattern. This indicated that the A-C list apparently had overwritten the A-B list. This could be likened to learning the word dog, followed by learning the word stool and then finding that you think of the word stool when presented with the word dog.
McCloskey and Cohen tried to reduce interference through a number of manipulations including changing the number of hidden units, changing the value of the learning rate parameter, overtraining on the A-B list, freezing certain connection weights, changing target values 0 and 1 instead 0.1 and 0.9. However, none of these manipulations satisfactorily reduced the catastrophic interference exhibited by the networks.
Overall, McCloskey and Cohen concluded that:
- at least some interference will occur whenever new learning alters the weights involved in representing old learning
- the greater the amount of new learning, the greater the disruption in old knowledge
- interference was catastrophic in the backpropagation networks when learning was sequential but not concurrent
''Constraints Imposed by Learning and Forgetting Functions'': Ratcliff (1990)
Ratcliff used multiple sets of backpropagation models applied to standard recognition memory procedures, in which the items were sequentially learned. After inspecting the recognition performance models he found two major problems:
- Well-learned information was catastrophically forgotten as new information was learned in both small and large backpropagation networks.
Even one learning trial with new information resulted in a significant loss of the old information, paralleling the findings of McCloskey and Cohen. Ratcliff also found that the resulting outputs were often a blend of the previous input and the new input. In larger networks, items learned in groups were more resistant to forgetting than were items learned singly. However, the forgetting for items learned in groups was still large. Adding new hidden units to the network did not reduce interference.
- Discrimination between the studied items and previously unseen items decreased as the network learned more.
This finding contradicts studies on human memory, which indicated that discrimination increases with learning. Ratcliff attempted to alleviate this problem by adding 'response nodes' that would selectively respond to old and new inputs. However, this method did not work as these response nodes would become active for all inputs. A model which used a context pattern also failed to increase discrimination between new and old items.Proposed solutions
The main cause of catastrophic interference seems to be overlap in the representations at the hidden layer of distributed neural networks. In a distributed representation, each input tends to create changes in the weights of many of the nodes. Catastrophic forgetting occurs because when many of the weights where "knowledge is stored" are changed, it is unlikely for prior knowledge to be kept intact. During sequential learning, the inputs become mixed, with the new inputs being superimposed on top of the old ones. Another way to conceptualize this is by visualizing learning as a movement through a weight space. This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network could possess. When a network first learns to represent a set of patterns, it finds a point in the weight space that allows it to recognize all of those patterns. However, when the network then learns a new set of patterns, it will move to a place in the weight space for which the only concern is the recognition of the new patterns. To recognize both sets of patterns, the network must find a place in the weight space suitable for recognizing both the new and the old patterns.
Below are a number of techniques which have empirical support in successfully reducing catastrophic interference in backpropagation neural networks:Orthogonality
Many of the early techniques in reducing representational overlap involved making either the input vectors or the hidden unit activation patterns orthogonal to one another. Lewandowsky and Li noted that the interference between sequentially learned patterns is minimized if the input vectors are orthogonal to each other. Input vectors are said to be orthogonal to each other if the pairwise product of their elements across the two vectors sum to zero. For example, the patterns and are said to be orthogonal because = 0. One of the techniques which can create orthogonal representations at the hidden layers involves bipolar feature coding. Orthogonal patterns tend to produce less interference with each other. However, not all learning problems can be represented using these types of vectors and some studies report that the degree of interference is still problematic with orthogonal vectors.Node sharpening technique
According to French, catastrophic interference arises in feedforward backpropagation networks due to the interaction of node activations, or activation overlap, that occurs in distributed representations at the hidden layer. Neural networks that employ very localized representations do not show catastrophic interference because of the lack of overlap at the hidden layer. French therefore suggested that reducing the value of activation overlap at the hidden layer would reduce catastrophic interference in distributed networks. Specifically he proposed that this could be done through changing the distributed representations at the hidden layer to 'semi-distributed' representations. A 'semi-distributed' representation has fewer hidden nodes that are active, and/or a lower activation value for these nodes, for each representation, which will make the representations of the different inputs overlap less at the hidden layer. French recommended that this could be done through 'activation sharpening', a technique which slightly increases the activation of a certain number of the most active nodes in the hidden layer, slightly reduces the activation of all the other units and then changes the input-to-hidden layer weights to reflect these activation changes.