Contents lists available at ScienceDirect
Neurocomputing
journa l homepa ge: www.elsevier.com/locate/neucom
Online gradient method with smoothing ℓ
0
regularization for feedforward
neural networks
Huisheng Zhang
⁎
, Yanli Tang
Department of Mathematics, Dalian Maritime University, Dalian 116026, China
ARTICLE INFO
Communicated by Sanguineti Marcello
Keywords:
Online learning
Gradient training algorithm
Smoothing ℓ
0
regularization
Feedforward neural networks
Convergence
Sparsity
ABSTRACT
ℓ
p
regularization has been a popular pruning method for neural networks. The parameter p was usually set as
p
<≤
in the literature, and practical training algorithms with ℓ
0
regularization are lacking due to the NP-
hard nature of the ℓ
0
regularization problem; however, the ℓ
0
regularization tends to produce the sparsest
solution, corresponding to the most parsimonious network structure which is desirable in view of the
generalization ability. To this end, this paper considers an online gradient training algorithm with smoothing
ℓ
0
regularization (OGTSL0) for feedforward neural networks, where the ℓ
0
regularizer is approximated by a
series of smoothing functions. The underlying principle for the sparsity of OGTSL0 is provided, and the
convergence of the algorithm is also theoretically analyzed. Simulation examples support the theoretical analysis
and illustrate the superiority of the proposed algorithm.
1. Introduction
Multilayer feedforward neural networks (FNNs) has been widely
used in various fields [1,2]. The training of FNNs can be reduced to
solving nonlinear least square problems, to which numerous traditional
numerical methods, such as the gradient descent method, Newton
method [3], conjugate gradient method [4], extended Kalman filtering
[5], Levenberg-Marquardt method [6], etc., can be applied. Among
those training methods, backpropagation algorithm, which is derived
based on the gradient descent rule, has become one of the most popular
training strategy for its simplicity and ease of implementation [7].
Gradient-based learning can be implemented in two practical ways: the
batch learning and the online learning [8]. The batch learning approach
accumulates the weight correction over all training samples before
actually performing the update, nevertheless the online learning
approach updates the network weights immediately after each training
sample is fed. In this way, batch gradient training method corresponds
to the standard gradient descent algorithm, while the online gradient
training method directly makes use of the instantaneous approximated
gradient information and has enhanced ability especially when dealing
with big or redundant data [9]. Besides the gradient-based learning, as
another effective learning strategy, extreme learning machine has also
been proposed and investigated by taking the batch mode [10] and
online mode [11] separately.
The appropriate network size is crucial to the learning effectiveness
in real applications. Too small network cannot learn the data suffi-
ciently, whereas too large network easily leads to the well-known
overfitting problem and poor generalization. Though there have been
many related works in the literature, it is still hard to give an accurate
formula for the optimal network size [12]. There have been two
practical approaches instead: One is the constructive method, starting
with a minimal network and adding new nodes until the training
results are acceptable [13], and another is the pruning method, starting
with an oversized network and then removing the unimportant nodes
or weights [14].
ℓ
p
regularization learning is such a popular pruning method, aiming
at optimizing the network structure and weights simultaneously [15–
20]. By adding an ℓ
p
regularization term to the common error function
w(
, the modified error function takes the form
λwww()= ()+‖‖,
p
p
(1)
where λ is the regularization coefficient to balance the tradeoff between
the training accuracy and the network complexity, and
·∥
p
is the usual
ℓ
p
norm. The well-known “weight decay” technique just corresponds to
the ℓ
2
regularization, and has been shown to be effective in controlling
the magnitude of the network weights and improving the generalization
performance of the trained networks [15,21–25]. However, “weight
decay” actually does not prune the network in the sense that almost no
sparse solutions can be produced by ℓ
2
regularization. According to the
regularization theory, an ℓ
p
regularization method produces sparse
solution only when
p0≤ ≤
, and the smaller the p, the sparser the
solution [26–30]. Peter M. Williams and Ishikawa proposed to use ℓ
1
http://dx.doi.org/10.1016/j.neucom.2016.10.057
Received 15 October 2015; Received in revised form 17 October 2016; Accepted 30 October 2016
⁎
Corresponding author.
E-mail address: zhhuisheng@dlmu.edu.cn (H. Zhang).
Neurocomputing 224 (2017) 1–8
Available online 03 November 2016
0925-2312/ © 2016 Elsevier B.V. All rights reserved.
MARK