英特尔至强融核上OpenMP与OpenCL并行非局部均值去噪算法

42 浏览量更新于2024-08-26 收藏 1.67MB PDF 举报

本文探讨了在英特尔至强融核（Intel Xeon Phi Coprocessor）上利用OpenMP和OpenCL并行化非局部均值（Non-Local means, NLM）去噪算法的具体实现策略。非局部均值去噪算法是一种在图像处理领域广泛应用的技术，它通过计算像素之间的相似性权重，寻找具有相似特征的区域进行降噪，从而提高图像质量。在传统的NLM算法中，计算密集型任务会消耗大量计算资源，尤其是在处理高分辨率图像时。为了充分利用英特尔至强融核的多核心和向量化能力，研究者采用并行编程模型OpenMP和OpenCL来加速算法执行。OpenMP是一种基于C/C++的并行编程接口，它允许开发者通过简单的指令来利用多线程并行处理，适合于共享内存系统。而OpenCL则是一种跨平台的并行计算语言，支持GPU、CPU和FPGA等多种硬件，适用于计算密集型任务的并行化。在英特尔至强融核这种特殊的许多核心（Many Integrated Core, MIC）架构中，OpenMP和OpenCL的结合能够显著提升算法的性能。作者首先分析了如何将NLM算法分解为可并行的任务单元，例如计算相似度矩阵、寻找最近邻像素等。然后，他们利用OpenMP管理线程间的协作，将任务分配到多个核心上，同时利用OpenCL的优势，通过图形处理单元（GPU）进行大量浮点运算，加速数据处理。文章详细介绍了并行化策略的实现步骤，包括数据组织、通信优化以及同步控制，以确保在分布式内存环境中保持良好的性能。此外，还讨论了并行NLM去噪算法在实际应用中的效果评估，包括处理时间、去噪效果与单线程版本的比较，以及对不同图像质量和噪声水平的适应性。该研究的意义在于展示了如何通过现代并行编程技术提升图像处理算法的性能，特别是在资源受限的嵌入式系统或高性能计算环境中。这对于推动计算机视觉、机器学习和实时图像处理等领域的发展具有重要意义。通过在英特尔至强融核平台上实现高效的并行NLM去噪算法，可以为后续的研究者提供一个性能基准和实践案例，推动并行计算在图像处理领域的广泛应用。

Journal

Computational

Science

(2016)

591–598

Contents

lists

available

ScienceDirect

Journal

Computational

Science

journa

epage:

www.elsevier.com/locate/jocs

parallel

Non-Local

means

denoising

algorithm

implementation

with

OpenMP

and

OpenCL

Intel

Xeon

Phi

Coprocessor

Huming

Zhu

∗

Yanfei

Wu,

Pei

Li,

Duo

Wang,

Wei

Shi,

Peng

Zhang,

Licheng

Jiao

Key

Laboratory

Intelligent

Perception

and

Image

Understanding

Ministry

Education,

International

Research

Center

for

Intelligent

Perception

and

Computation,

Xidian

University,

Xi’an,

Shaanxi

Province

710071,

China

Article

history:

Received

October

2015

Received

revised

form

April

2016

Accepted

July

2016

Available

online

July

2016

Keywords:

Parallel

algorithm

Non-Local

means

denoising

OpenMP

OpenCL

MIC

The

Non-Local

means

(NLM)

denoising

algorithm

calculates

similarity

weight

between

denoising

pix-

els

and

searching

area

pixels

establishing

similar

functions.

texture

denoising

and

edge

region

denoising

domain,

the

Non-Local

Means

denoising

algorithm

performs

better

than

many

other

existing

denoising

algorithms

because

uses

the

redundant

information

images.

However,

NLM

algorithm

has

defect

speed

for

the

huge

computational

amount.

Recently,

Intel

Xeon

Phi

Coprocessor

(based

Intel

Many

Integrated

Core

architecture,

MIC)

exhibits

huge

superiority

speedup

computation.

Therefore

design

parallel

algorithm

strategies

OpenMP

and

OpenCL

based

the

serial

NLM

algorithm

for

MIC

architecture,

and

conduct

the

experiment

CPU,

GPU,

and

MIC

with

images

different

sizes.

The

experiment

suggests

that

the

OpenMP-based

NLM

algorithm

has

better

performance

Xeon

Phi

7120

than

Xeon

2692

when

the

image

size

greater

than

equal

1024*1024,

the

OpenCL-based

NLM

algorithm

has

better

performance

Xeon

Phi

7120

than

NVIDIA

Kepler

K20M

GPU,

and

OpenCL-based

NLM

algorithm

performs

little

better

than

OpenMP-based

NLM

algorithm

when

they

both

implemented

Intel

Xeon

Phi

7120.

2016

Elsevier

B.V.

All

rights

reserved.

Introduction

Image

denoising

the

process

recovering

the

original

image

from

the

noisy

image.

recent

years,

varieties

image

denoising

algorithms

have

been

proposed

the

academic

community,

which

are

generally

classiﬁed

spatial

denoising

method

frequency

denoising

method.

There

are

some

classical

denoising

algorithms

spatial

denoising

domain,

including

isotropic

linear

ﬁltering,

median

ﬁltering,

etc.

contrast,

some

mature

algorithms

belong

frequency

method,

such

Wiener

ﬁltering

and

wavelet

thresh-

olding

method

[1].

2005,

Buades

al.

proposed

representative

ﬁltering

method

called

Non-Local

Means

(NLM)

denoising

algorithm

[2],

which

improvement

bilateral

ﬁltering.

However,

NLM

computa-

tionally

demanding

each

noisy

pixel

replaced

weighted

average

all

the

pixels

large

window

whole

image.

Therefore,

many

variants

have

arisen

decrease

the

com-

∗

Corresponding

author.

E-mail

addresses:

zhuhum@mail.xidian.edu.cn

(H.

Zhu),

wuyanfei@stu.xidian.edu.cn

(Y.

Wu),

1570558611@qq.com

(P.

Li),

770453932@qq.com

(D.

Wang),

1063929728@qq.com

(W.

Shi),

1943379881@qq.com

(P.

Zhang),

jlcxidian@163.com

(L.

Jiao).

putational

time

such

utilizing

its

highly

parallelizable

nature

and

attempting

decrease

the

computational

time

for

single

pixel

location.

2006,

Wang

al.

proposed

fast

Non-Local

Means

denoising

algorithm

[3],

which

Fast

Fourier

Transform

(FFT)

used

accelerate

the

weight

calculation.

2009,

Kar-

nati

al.

proposed

modiﬁed

multi-resolution

pyramid

architecture

accelerate

the

computation

window

similarity

[4].

2014,

Bhujle

al.

proposed

novel

speed-up

strategy

that

build

dictionary

similar

patches

very

quickly

and

reduce

the

computational

cost

[5].

Taking

the

complexity

the

computation

into

consideration,

all

the

algorithms

mentioned

above

still

can’t

meet

the

real-time

request,

has

important

practical

signiﬁ-

cance

accelerate

the

image

denoising

algorithm.

Fortunately,

the

development

computing

platform

makes

possible

solve

the

problem.

2013,

Palma

al.

proposed

fully

NLM

denoising

multi-GPU

architecture

and

meet

the

requirement

acceptable

performance

for

real-time

scenarios

[6].

Intel

MIC

architecture

specially

designed

for

high

performance

computing

(HPC),

which

regarded

the

generation

plat-

form.

The

advantages

MIC

architecture

are

the

simplicity

programming

and

the

convenience

using

the

existing

tools

[7].

The

source

code

the

application

can

modiﬁed

easier

MIC

than

GPU

owing

its

similar

framework

CPU.

The

Intel

tools

including

compiler,

proﬁling

tool

and

debugging

tool

can

used

http://dx.doi.org/10.1016/j.jocs.2016.07.001

1877-7503/©

2016

Elsevier

B.V.

All

rights

reserved.

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38522323

粉丝: 5
资源: 908

英特尔至强融核上OpenMP与OpenCL并行非局部均值去噪算法

基于GPU加速的非局部均值图像去噪算法源码+项目说明（与单核和多核(OpenMP)运行模式进行了加速效果的对比）.zip

matlab集成c代码-Intel-Xeon-Phi:在英特尔至强融核协处理器上运行软件的示例

GPU加速非局部均值去噪算法源码与加速效果对比分析

基于GPU加速的非局部均值图像去噪算法完整源码+说明（与单核和多核(OpenMP)运行模式进行了加速效果的对比）.zip

Parallel-Computing:并行计算的基础介绍并行算法的实现、MPI、OpenMP和CUDA并行

canny-edge-parallel:使用OpenMP，CUDA和OpenCL的Canny Edge Detection算法的并行实现

论文研究-基于OpenMP的三维并行Delaunay网格生成算法及实现.pdf

如何使用英特尔C ++编译器和OpenMP 4.5库实现并行的“稳定”三向快速排序

基于OpenMP求解QAP的并行粒子群优化算法.pdf

基于OpenMP矩阵相乘并行算法的设计

最新资源