Convergence of batch gradient learning algorithm with smoothing
L
1/2
regularization for Sigma–Pi–Sigma neural networks
$
Yan Liu
a,d
, Zhengxue Li
b,
n
, Dakun Yang
c
, Kh.Sh. Mohamed
b
, Jing Wang
d
,WeiWu
b
a
School of Information Science and Engineering, Dalian Polytechnic University, Dalian 116034, China
b
School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China
c
School of Information Science and Technology, Sun Yat-sen University, Guangzhou 510006, China
d
School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116024, China
article info
Article history:
Received 9 May 2014
Received in revised form
29 July 2014
Accepted 15 September 2014
Communicated by M.-J. Er
Available online 30 September 2014
Keywords:
Sigma–Pi–Sigma neural networks
Batch gradient learning algorithm
Convergence
Smoothing L
1/2
regularization
abstract
Sigma–Pi–Sigma neural networks are known to provide more powerful mapping capability than
traditional feed-forward neural networks. The L
1/2
regularizer is very useful and efficient, and can be
taken as a representative of all the L
q
ð0o qo 1Þ regularizers. However, the nonsmoothness of L
1/2
regularization may lead to oscillation phenomenon. The aim of this paper is to develop a novel batch
gradient method with smoothing L
1/2
regularization for Sigma–Pi–Sigma neural networks. Compared
with conventional gradient learning algorithm, this method produces sparser weights and simpler
structure, and it improves the learning efficiency. A comprehensive study on the weak and strong
convergence results for this algorithm are also presented, indicating that the gradient of the error
function goes to zero and the weight sequence goes to a fixed value, respectively.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Sigma–Pi–Sigma neural networks (SPSNNs) are considered as
efficient high-order neural networks which can learn to imple-
ment static mapping that multilayer neural networks and radial
basis function networks usually do [1], since the output of the
SPSNNs has the sum of product-of-sum form. A self-organizing
map of Sigma–Pi units was provided in [2]. The applicability of
networks built on Sigma–Pi units with Elman topology was
explored in [3]. A recurrent Sigma–Pi neural network was selected
as the network architecture providing strong dynamical properties
for the modelling of some non-linear time series [4]. The function
approximation capacity, convergence behavior and generalization
ability of sparselized Sigma–Pi networks were analyzed and
compared with those of first-order networks [5,6]. The ridge
polynomial neural network is a special type of higher-order neural
networks using a number of product units as its basic building
blocks, which not only provides a more efficient and regular
architecture, but also maintains the fast learning property and
powerful nonlinear mapping capability while avoiding the combi-
natorial increase in the number of required weights [7]. A binary
product-unit neural network is proposed in order to more effi-
ciently realize Boolean functions [8].
Much attention in neural computing has been paid recently to
improve the structure of networks. The number of neurons is a
crucial factor of dynamic capability for feedforward networks.
There are two common approaches to determinate the appropriate
size for a network. The first is to start from a minimal network and
to increase the number of units, and the other is to start from a
maximum network and to prune it (e.g. [9–11]).
A penalty term for pruning feedforward neural networks is
wildly used for weight elimination, which is to discourage the use
of unnecessary connections. One of the simplest penalties added
to the standard cost function is a term proportional to the L
2
norm
of the weight vectors [12–15], which is used to discourage the
weights from taking large values:
E ¼
~
E þ
λJ w J
2
2
; ð1Þ
where
~
E is a standard cost function,
λJ w J
2
2
is a penalty term and
λ4 0 is a scalar that determines the influence of penalty term, and
jj jj
2
stands for the L
2
norm. This strategy is called L
2
regulariza-
tion. Moreover, L
q
regularization [16] is widely used in the
parameter estimation and is recently used as a feasible approach
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2014.09.031
0925-2312/& 2014 Elsevier B.V. All rights reserved.
☆
This work is supported by the National Natural Science Foundation of
China (Nos. 61473059 and 61403056), the Fundamental Research Funds for the
Central Universities of China, Foundation of Liaoning Educational Committee
(No. L2014218) and the Youth Foundation of Dalian Polytechnic University
(QNJJ201308).
n
Corresponding author.
E-mail address: lizx@dlut.edu.cn (Z. Li).
Neurocomputing 151 (2015) 333–341