that co ntribute the most to previous tasks. We use EWC to
evaluate the regularization mec hanism.
Ensemble Methods
Ensemble methods attempt to mitigate catastrophic forget-
ting either b y explicitly or implicitly training multiple classi-
fiers togeth er and then combinin g them to genera te the final
prediction. For the explicit methods, such as L earn++ and
TradaBoost, this prevents fo rgetting because an en tirely new
sub-network is trained for a new batch (Polika r et al. 2001;
Dai et al. 2007). However, memory usage will scale with
the number of batches, which is highly non-desirable. More-
over, this prevents portions of the network from being re-
used for the new batch. Two methods that try to alleviate
the memory usage problem are Accuracy Weighted Ensem-
bles and Life-lo ng Machine L earning (Wang et al. 2003; Ren
et al. 2017). These meth ods automatically decide whether a
sub-network should be removed or added to the ensemble.
PathNet can be considered as an implicit ensemble
method (Fernan do et al. 201 7). It uses a genetic algor ithm
to find an optimal path through a fixed-size neu ral network
for each batch that it learns. The weights in this path are then
frozen; so that when new batches are learned, the knowledge
is not lost. In contrast to the explicit ensembles, th e base
network’s size is fixed and it is possible for learned repre-
sentations to be re-used which allows for smaller, more de-
ployable models. The authors showed that PathNet learned
subsequen t tasks more quickly, but not how well e arlier tasks
were retained. We have selected PathNet to evaluate the en-
sembling mechanism, and we show how well it re ta ins pre-
trained inform ation.
Rehearsal Methods
Rehearsal methods try to mitigate catastrop hic forgetting by
mixing data from earlier batches w ith the current batch to be
learned (Robins 1995 ). The cost is that this requir e s storing
past da ta , which is not resource efficient. Pseudorehearsal
methods use the network to gener ate pseudopatterns (Robins
1995) that are combined with the new batch to be learned.
Pseudopatter ns allow th e network to stabilize older memo-
ries without the requirement for storing a ll previously o b-
served training data points. Draelos et al. (2016) used this
approa c h to incrementally train an autoencoder, where each
batch contained images from a specific category. After the
autoencoder learned a particular batch, they passed the batch
through the encoder and stored the output statistics. During
replay, they used these statistics and the decoder network to
generate the appr opriate pseudopatterns for each class.
The SOM model proposed by Ge pperth and Karaoguz
(2016) reserves its training data to replay after each new
class was trained. This model used a self-organizing map
(SOM) as a hidden-layer to topologically reorganize the data
from the input layer (i.e., c lustering the inpu t onto a 2-D lat-
tice). We use this mod e l to explore the value of r ehearsal.
Dual-Memory Models
Dual-memory models are inspired by memory consolidation
in the mam malian brain, which is thought to store memo-
ries in two distinct neural network s. N ewly formed memo-
ries are stored in a br ain region known as the hippocampus.
These memories are then slowly transferred/consolidated to
the pre-fr ontal cortex during sleep. Several a lgorithms based
on these ideas have been created. Early work used fast (hip-
pocamp al) and slow (cortical) training networks to sepa-
rate patter n-processing areas, and they passed pseudopat-
terns back and forth to consolidate recent and remote mem-
ories (French 1997). In general, dual-memory models incor-
porate rehearsal, but not all rehearsal-based models are dual-
memory models.
Another model proposed by Gepperth and Karaoguz
(2016), which we denote STM, stor es new inputs that yield a
highly uncertain prediction into a short-term memory buffer.
This model then seeks to co nsolidate the new mem ories
into the entire network during a separate sleep phase. They
showed that STM could incrementally learn MNIST classes
without forgetting previously trained one s. We use SOM and
STM to evalu a te the dual-memory approach.
Sparse-Coding Methods
Catastrophic forgetting occurs when new internal represen-
tations interfere with previously learned ones ( French 1999).
Sparse representations ca n reduce the chanc e of this interfer-
ence; however, sparsity can impair generalization and ability
to learn new tasks (Sharkey and Sharkey 1995).
Two mod els that implicitly use sparsity ar e CALM and
ALCOVE. To learn new data, CALM searches among com-
peting nodes to see which nodes have not been commit-
ted to another representation (Murre 2014). ALCOVE is a
shallow neural network that uses a spar se distance-based
representation, which allows the weights assigned to older
tasks to be largely unc hanged when the network is pre-
sented with new data (Kruschke 1992). The Sparse Dis-
tributed Memory (SDM) is a convolution-correlation model
that uses sparsity to reduce the overlap between internal rep-
resentations (Kanerva 1988) . CHARM and TODAM are also
convolution-cor relation mode ls th a t use internal codings to
ensure that new input re presentations re main orthogonal to
one another (Murdock 1983; Eich 1982).
The Fixed Expansion Layer (FEL) model creates sparse
representations by fixing the network ’s weights and speci-
fying neuron triggering conditions (Coop, Mishtal, and Arel
2013). FEL uses excitatory and inhibitory fixed weights to
sparsify the input, which gates the weight updates through-
out the network. This enables the network to retain prior
learned mappings and reduce representational overlap. We
use FEL to evaluate the sparsity mechanism.
Experimental Setup
We explore how well methods to mitigate catastrophic for-
getting scale on ha rd datasets involving fine-grained image
and audio classification. These datasets were chosen be-
cause they contain 1) different data modalities (image and
audio), 2) a large number of classes, and 3) a small number
of samples per class. These datasets are more meaningful
(real-world pro blems) and more practical than MNIST. We
also use MNIST to showcase the value of these real-world
datasets. See Table 1 for dataset statistics.