1990). An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” NNs and low
expected overfitting. Compare Sec. 5.6.4 and more recent developments mentioned in Sec. 5.24.
5.6.4 Potential Benefits of UL for SL
The notation of Sec. 2 introduced teacher-given labels d
t
. Many papers of the previous millennium, how-
ever, were about unsupervised learning (UL) without a teacher (e.g., Hebb, 1949; von der Malsburg, 1973;
Kohonen, 1972, 1982, 1988; Willshaw and von der Malsburg, 1976; Grossberg, 1976a,b; Watanabe, 1985;
Pearlmutter and Hinton, 1986; Barrow, 1987; Field, 1987; Oja, 1989; Barlow et al., 1989; Baldi and Hornik,
1989; Rubner and Tavan, 1989; Sanger, 1989; Ritter and Kohonen, 1989; Rubner and Schulten, 1990;
F
¨
oldi
´
ak, 1990; Martinetz et al., 1990; Kosko, 1990; Mozer, 1991; Palm, 1992; Atick et al., 1992; Miller,
1994; Saund, 1994; F
¨
oldi
´
ak and Young, 1995; Deco and Parra, 1997); see also post-2000 work (e.g.,
Carreira-Perpinan, 2001; Wiskott and Sejnowski, 2002; Franzius et al., 2007; Waydo and Koch, 2008).
Many UL methods are designed to maximize entropy-related, information-theoretic (Boltzmann, 1909;
Shannon, 1948; Kullback and Leibler, 1951) objectives (e.g., Linsker, 1988; Barlow et al., 1989; MacKay
and Miller, 1990; Plumbley, 1991; Schmidhuber, 1992b,c; Schraudolph and Sejnowski, 1993; Redlich,
1993; Zemel, 1993; Zemel and Hinton, 1994; Field, 1994; Hinton et al., 1995; Dayan and Zemel, 1995;
Amari et al., 1996; Deco and Parra, 1997). Many do this to uncover and disentangle hidden underlying
sources of signals (e.g., Jutten and Herault, 1991; Schuster, 1992; Andrade et al., 1993; Molgedey and
Schuster, 1994; Comon, 1994; Cardoso, 1994; Bell and Sejnowski, 1995; Karhunen and Joutsensalo, 1995;
Belouchrani et al., 1997; Hyv
¨
arinen et al., 2001; Szab
´
o et al., 2006). UL can also serve to extract invariant
features from different data items (e.g., Becker, 1991; Schmidhuber and Prelinger, 1993; Taylor et al.,
2011) through coupled NNs (also called Siamese NNs, e.g., Bromley et al., 1993; Hadsell et al., 2006).
Many UL methods automatically and robustly generate distributed, sparse representations of input pat-
terns (F
¨
oldi
´
ak, 1990; Hinton and Ghahramani, 1997; Lewicki and Olshausen, 1998; Hyv
¨
arinen et al., 1999;
Hochreiter and Schmidhuber, 1999; Falconbridge et al., 2006) through well-known feature detectors (e.g.,
Olshausen and Field, 1996; Schmidhuber et al., 1996), such as off-center-on-surround-like structures, as
well as orientation sensitive edge detectors and Gabor filters (Gabor, 1946). They extract simple features
related to those observed in early visual pre-processing stages of biological systems (e.g., De Valois et al.,
1982; Jones and Palmer, 1987).
UL can help to encode input data in a form advantageous for further processing. In the context of
DL, one important goal of UL is redundancy reduction. Ideally, given an ensemble of input patterns,
redundancy reduction through a deep NN will create a factorial code (a code with statistically independent
components) of the ensemble (Barlow et al., 1989; Barlow, 1989), to disentangle the unknown factors of
variation (compare Bengio et al., 2013). Such codes may be sparse and can be advantageous for (1) data
compression, (2) speeding up subsequent BP (Becker, 1991), (3) trivialising the task of subsequent naive
yet optimal Bayes classifiers (Schmidhuber et al., 1996).
Most early UL FNNs had a single layer. Methods for deeper UL FNNs include hierarchical (Sec. 4.3)
self-organizing Kohonen maps (e.g., Koikkalainen and Oja, 1990; Lampinen and Oja, 1992; Versino and
Gambardella, 1996; Dittenbach et al., 2000; Rauber et al., 2002), hierarchical Gaussian potential function
networks (Lee and Kil, 1991), the Self-Organising Tree Algorithm (SOTA) (Herrero et al., 2001), and
nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (Kramer, 1991; Oja, 1991; DeMers and
Cottrell, 1993). Such AE NNs (Rumelhart et al., 1986) can be trained to map input patterns to themselves,
for example, by compactly encoding them through activations of units of a narrow bottleneck hidden layer.
See (Baldi, 2012) for limitations of certain nonlinear AEs.
Other nonlinear UL methods include Predictability Minimization (PM) (Schmidhuber, 1992c), where
nonlinear feature detectors fight nonlinear predictors, trying to become both informative and as unpre-
dictable as possible, and LOCOCODE (Hochreiter and Schmidhuber, 1999), where FMS (Sec. 5.6.3) finds
low-complexity AEs with low-precision weights describable by few bits of information, often yielding
sparse or factorial codes. PM-based UL was applied not only to FNNs but also to RNNs (e.g., Schmidhu-
ber, 1993b; Lindst
¨
adt, 1993a,b). Compare Sec. 5.10 on UL-based RNN stacks (1991), as well as later UL
RNNs (e.g., Klapper-Rybicka et al., 2001; Steil, 2007).
13