Hardware Implementation of KLMS Algorithm using FPGA
Xiaowei Ren, Pengju Ren, Badong Chen, Tai Min and Nanning Zheng
Abstract— Fast and accurate machine learning algorithms are
needed in many physical applications. However, the learning ef-
ficiency is badly subjected to the intensive computation. Know-
ing that hardware implementation could speed up computation
effectively, we use a FPGA hardware platform to implement an
on-line kernel learning algorithm, namely the kernel least mean
square (KLMS) which adopts the simple survival kernel as the
Mercer kernel. By using an on-line quantization method and
pipeline technology, the requirement of hardware resources and
computation burden can be reduced significantly and the data
processing speed can be accelerated apparently without losing
accuracy. Finally, a 128-way parallel FPGA platform which
works at 200MHz is implemented. It could achieve an average
speedup of 6553 versus Matlab running on a 3GHz Intel(R)
Core(TM) i5-2320 CPU.
I. INTRODUCTION
K
ERNEL adaptive filters (KAFs) [1] are a family of
nonlinear adaptive filtering algorithms, which have
been applied to machine learning [2] and signal process-
ing [3] successfully during the past few years, including
KLMS [4] [5], kernel recursive least squares (KRLS) [6] and
kernel affine projection algorithms (KAPAs) [7] etc. Among
these algorithms, KLMS is the simplest, which is easy to
implement without losing effectiveness.
However, when we make use of kernel adaptive filters, two
critical issues should be concerned cautiously. The first one
is to choose a dedicated kernel, such as Gaussian kernel [8]
and multiple-kernel [9], to ensure good performance of the
algorithms. The second one is that all kernel adaptive filters
suffer from the constantly growing network size, leading
to a serious memory and computation burden. Approximate
linear dependency criterion (ALD) [6], surprise criterion
(SC) [10], prediction variance criterion [11] and quantization
methods [12] are some main techniques that have been put
forward to constrain the network size.
While various kinds of techniques have been put for-
ward to reduce the complexity of machine learning meth-
ods, intensive computation is still the critical restriction
of on-line (real-time) learning. Note that hardware devices
could accelerate mathematical operations in orders of mag-
nitude [13] [14] [15], we consider to implement some
Xiaowei Ren, Pengju Ren, Badong Chen and Nanning Zheng are with the
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University,
28 Xianning West Road, Xi’an 710049, China (email: {renxiaowei66,
pengjuren}@gmail.com, {chenbd, nnzheng}@mail.xjtu.edu.cn).
Tai Min is with the IMEC, Kapeldreef 75, B-3001 Leuven, Belgium
(email: {tmdfz}@hotmail.com).
This research was supported by NSFC grant No.61372152 and
No.610303036, China Postdoctoral Science Foundation No.2012M521777,
Specialized Research Fund for the Doctoral Program of Higher Education of
China No.20130201120024, Natural Science Basic Research Plan in Shaanxi
Province of China No.2013JQ8029 and the Fundamental Research Funds for
the Central Universities.
algorithms with hardware platform, instead of conventional
software methods. Meanwhile, some specific work have been
done to improve the performance of software algorithms with
hardware implementation. For example, in order to deal with
the intensive computation, a VLSI of dynamic codebook
generator and encoder for image compression applications is
described in [16]. Furthermore, a VLSI is realized in [17]
to make the HVQ (hierarchical vector quantization) cost-
effective and computationally efficient.
In this paper, we implement a FPGA processing element
(PE) of KLMS. The kernel what we choose is a new
Mercer kernel, namely the survival kernel [18], which is
suitable for on-line KLMS because it is parameter-free and
computationally simple. Meanwhile, we adopt a quantization
approach [12] (so this new KLMS is called QKLMS) to relax
the memory and computation burden yet guarantee the accu-
racy of algorithm. Moreover, pipeline technology [19] [20]
is applied to explore the concurrency in or between each
operation and augment resource reuse rate. Finally, at very
low hardware cost, we finish the implementation of a parallel
FPGA platform which is used to process 128-way data
training simultaneously. When it works at 200MHz, it’s
6553 times faster than Matlab running on a 3GHz Intel(R)
Core(TM) i5-2320 CPU.
The remained part of this paper is organized as follows.
Section II gives a brief description of the KLMS algorithm,
quantization method and survival kernel. The architecture of
processing element is elaborated in section III. In section
IV, performance evaluation and implementation results are
shown. Finally, this work is concluded in section V.
II. KLMS, QKLMS AND SURVIVAL KERNEL
In this section, we will present some basic background
information related to our work. The first one is KLMS,
a kernel learning method what we implement with FPGA
in this paper. Then a quantization approach QKLMS used
to reduce the network size is briefly described. We also
introduce the survival kernel that we choose.
A. KLMS
In fact, KLMS is a stochastic gradient algorithm to solve
the least-square (LS) problem in reproducing kernel Hilbert
spaces (RKHS). A Mercer kernel is a continuous, symmetric,
positive-definitive function defined on X × X, i.e. κ :
X × X → R. It could be expressed in the formula of
κ(x
m
, x
n
). By the Mercer’s theorem, any Mercer kernel
induces a mapping Ψ between input space X and a feature
space F (which is an inner product space) such that:
κ(x
m
, x
n
) = Ψ(x
m
)
T
Ψ(x
n
) (1)
2014 International Joint Conference on Neural Networks (IJCNN)
July 6-11, 2014, Beijing, China
978-1-4799-1484-5/14/$31.00 ©2014 IEEE