FPGA加速的KLMS算法：硬件实现与性能优化

138 浏览量更新于2024-08-26 收藏 3.95MB PDF 举报

本文主要探讨了在现代物理应用中对快速和精确机器学习算法的需求，尤其是在处理大量数据和复杂信号处理任务时。由于传统的软件实现往往受限于计算效率，作者们决定利用现场可编程门阵列(FPGA)作为硬件平台，来加速一种称为Kernel Least Mean Square (KLMS)的在线学习算法。KLMS算法是基于简单的生存内核，即Mercer核函数，其在非线性滤波和模式识别等领域具有广泛应用。 KLMS算法的核心在于其自适应性和灵活性，它能够在不损失精度的情况下，通过在线学习逐步优化滤波器参数。为了进一步提升性能，本文提出了一种创新的方法，即采用离线量化和流水线技术。这种技术有效地减少了硬件资源的需求，显著降低了计算负担，使得数据处理速度得到大幅提升。具体来说，研究人员设计并实现了一个运行在200MHz的128路并行FPGA平台，相比于在3GHz Intel(R) Core(TM)i5-2320 CPU上运行的Matlab，平均实现了约6553倍的速度提升。文章的开头部分介绍了kernel adaptive filters (KAFs)的基本概念和在机器学习中的重要性，强调了硬件加速在提升算法性能方面的重要性。作者Xiaowei Ren、Pengju Ren、Badong Chen、Tai Min 和 Nanning Zheng在文中详细阐述了他们如何通过精心设计的硬件架构，将KLMS算法从软件移植到硬件，以实现实时且高效的处理能力。此外，他们还可能讨论了设计过程中的挑战、优化策略以及实验结果的验证，从而展示了FPGA在实现此类复杂算法时的优势和潜力。这篇研究论文深入探讨了FPGA在KLMS算法硬件实现中的应用，包括技术细节、性能优化以及实际硬件平台的构建，为其他研究者提供了在实时机器学习任务中利用硬件加速的有效途径。通过结合理论与实践，这篇文章对于寻求高性能、低功耗的实时系统开发者具有重要的参考价值。

Hardware Implementation of KLMS Algorithm using FPGA

Xiaowei Ren, Pengju Ren, Badong Chen, Tai Min and Nanning Zheng

Abstract— Fast and accurate machine learning algorithms are

needed in many physical applications. However, the learning ef-

ﬁciency is badly subjected to the intensive computation. Know-

ing that hardware implementation could speed up computation

effectively, we use a FPGA hardware platform to implement an

on-line kernel learning algorithm, namely the kernel least mean

square (KLMS) which adopts the simple survival kernel as the

Mercer kernel. By using an on-line quantization method and

pipeline technology, the requirement of hardware resources and

computation burden can be reduced signiﬁcantly and the data

processing speed can be accelerated apparently without losing

accuracy. Finally, a 128-way parallel FPGA platform which

works at 200MHz is implemented. It could achieve an average

speedup of 6553 versus Matlab running on a 3GHz Intel(R)

Core(TM) i5-2320 CPU.

I. INTRODUCTION

ERNEL adaptive ﬁlters (KAFs) [1] are a family of

nonlinear adaptive ﬁltering algorithms, which have

been applied to machine learning [2] and signal process-

ing [3] successfully during the past few years, including

KLMS [4] [5], kernel recursive least squares (KRLS) [6] and

kernel afﬁne projection algorithms (KAPAs) [7] etc. Among

these algorithms, KLMS is the simplest, which is easy to

implement without losing effectiveness.

However, when we make use of kernel adaptive ﬁlters, two

critical issues should be concerned cautiously. The ﬁrst one

is to choose a dedicated kernel, such as Gaussian kernel [8]

and multiple-kernel [9], to ensure good performance of the

algorithms. The second one is that all kernel adaptive ﬁlters

suffer from the constantly growing network size, leading

to a serious memory and computation burden. Approximate

linear dependency criterion (ALD) [6], surprise criterion

(SC) [10], prediction variance criterion [11] and quantization

methods [12] are some main techniques that have been put

forward to constrain the network size.

While various kinds of techniques have been put for-

ward to reduce the complexity of machine learning meth-

ods, intensive computation is still the critical restriction

of on-line (real-time) learning. Note that hardware devices

could accelerate mathematical operations in orders of mag-

nitude [13] [14] [15], we consider to implement some

Xiaowei Ren, Pengju Ren, Badong Chen and Nanning Zheng are with the

Institute of Artiﬁcial Intelligence and Robotics, Xi’an Jiaotong University,

28 Xianning West Road, Xi’an 710049, China (email: {renxiaowei66,

pengjuren}@gmail.com, {chenbd, nnzheng}@mail.xjtu.edu.cn).

Tai Min is with the IMEC, Kapeldreef 75, B-3001 Leuven, Belgium

(email: {tmdfz}@hotmail.com).

This research was supported by NSFC grant No.61372152 and

No.610303036, China Postdoctoral Science Foundation No.2012M521777,

Specialized Research Fund for the Doctoral Program of Higher Education of

China No.20130201120024, Natural Science Basic Research Plan in Shaanxi

Province of China No.2013JQ8029 and the Fundamental Research Funds for

the Central Universities.

algorithms with hardware platform, instead of conventional

software methods. Meanwhile, some speciﬁc work have been

done to improve the performance of software algorithms with

hardware implementation. For example, in order to deal with

the intensive computation, a VLSI of dynamic codebook

generator and encoder for image compression applications is

described in [16]. Furthermore, a VLSI is realized in [17]

to make the HVQ (hierarchical vector quantization) cost-

effective and computationally efﬁcient.

In this paper, we implement a FPGA processing element

(PE) of KLMS. The kernel what we choose is a new

Mercer kernel, namely the survival kernel [18], which is

suitable for on-line KLMS because it is parameter-free and

computationally simple. Meanwhile, we adopt a quantization

approach [12] (so this new KLMS is called QKLMS) to relax

the memory and computation burden yet guarantee the accu-

racy of algorithm. Moreover, pipeline technology [19] [20]

is applied to explore the concurrency in or between each

operation and augment resource reuse rate. Finally, at very

low hardware cost, we ﬁnish the implementation of a parallel

FPGA platform which is used to process 128-way data

training simultaneously. When it works at 200MHz, it’s

6553 times faster than Matlab running on a 3GHz Intel(R)

Core(TM) i5-2320 CPU.

The remained part of this paper is organized as follows.

Section II gives a brief description of the KLMS algorithm,

quantization method and survival kernel. The architecture of

processing element is elaborated in section III. In section

IV, performance evaluation and implementation results are

shown. Finally, this work is concluded in section V.

II. KLMS, QKLMS AND SURVIVAL KERNEL

In this section, we will present some basic background

information related to our work. The ﬁrst one is KLMS,

a kernel learning method what we implement with FPGA

in this paper. Then a quantization approach QKLMS used

to reduce the network size is brieﬂy described. We also

introduce the survival kernel that we choose.

A. KLMS

In fact, KLMS is a stochastic gradient algorithm to solve

the least-square (LS) problem in reproducing kernel Hilbert

spaces (RKHS). A Mercer kernel is a continuous, symmetric,

positive-deﬁnitive function deﬁned on X × X, i.e. κ :

X × X → R. It could be expressed in the formula of

κ(x

, x

). By the Mercer’s theorem, any Mercer kernel

induces a mapping Ψ between input space X and a feature

space F (which is an inner product space) such that:

κ(x

, x

) = Ψ(x

)

Ψ(x

) (1)

2014 International Joint Conference on Neural Networks (IJCNN)

July 6-11, 2014, Beijing, China

2276

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38538472

粉丝: 5
资源: 858

FPGA加速的KLMS算法：硬件实现与性能优化

基于LMS算法的自适应滤波器FPGA代码实现

一种基于LMS算法的流水线ADC数字校准算法

LMS算法 对 LMS 算法进行了基于 FPGA 的实现

matlab_通过稀疏化的klms和klms算法实现了信道均衡问题

Hardware Implementation of KLMS Algorithm using FPGA

Python内核自适应过滤：在Python中实现LMS，RLS，KLMS和KRLS过滤器

nupjuk:KLMS。 重新设计

kmbox-master_KLMS_KRLS_核自适应滤波代码_KLMS_KAPA.zip

kmbox-master_KLMS_KRLS_核自适应滤波代码_KLMS_KAPA_源码.rar

稀疏核最小均方算法的稳态均方性能

最新资源

LMS算法对 LMS 算法进行了基于 FPGA 的实现

nupjuk:KLMS。重新设计