深度学习CNN在CPU全栈优化：提升3.45倍性能

需积分: 12 178 浏览量更新于2024-06-30 收藏 1.5MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文主要探讨了深度学习卷积神经网络（Convolutional Neural Network, CNN）在CPU上的优化问题。随着CNN模型的普及和CPU的广泛应用，提升其在CPU上的推理性能对于众多用户具有重大意义。当前的方法主要依赖于高性能库，如Intel MKL-DNN，以及基本的图形级别优化，这些方法存在限制，未能充分利用整个推理管道的优化潜力。论文提出了一种更全面的CNN模型在CPU上推理的优化策略，它采用了一个全栈式和系统级的优化方案。该方法的重点在于操作模板的优化，通过将操作与图形级别的优化相结合，实现了更深层次的性能提升。这种联合优化能够针对CNN模型的具体特性进行定制化处理，从而显著降低模型推理的延迟。实验结果显示，与现有的优化技术相比，该提出的解决方案能够将CNN模型的推理延迟降低多达3.45倍。这表明，通过深入挖掘CPU架构的潜力，并对整个推理流程进行精细设计，能够在保持模型精度的同时，实现性能的显著提升，这对于CPU作为主流计算平台的深度学习应用来说具有革命性的意义。具体优化策略可能包括但不限于以下几点： 1. **硬件加速器适配**：针对CPU的特定架构特点，优化底层运算单元的执行效率，比如利用SIMD（Single Instruction Multiple Data）并行处理能力来加速卷积和矩阵运算。 2. **操作融合**：通过合并或重排计算步骤，减少不必要的数据传输和内存访问，减少内存带宽压力。 3. **内存优化**：合理布局数据结构，减少缓存未命中的情况，提高内存访问效率。 4. **动态调度**：根据实时工作负载调整任务执行顺序，避免CPU核心间的竞争和浪费。 5. **编译器优化**：利用高级编译器技巧，如循环展开、常量折叠等，提高代码执行效率。 6. **量化和低秩分解**：通过量化网络参数，降低模型存储和计算复杂度，同时采用低秩分解等技术来减少矩阵运算的计算量。 7. **硬件扩展**：结合多线程或多核处理，利用并行计算能力加速模型推理。 8. **软件层面优化**：通过代码优化、算法改进和硬件加速库的选择，提高CPU上CNN推理的效率。这篇论文不仅关注于CNN模型本身，还涵盖了从底层硬件到上层软件的全方位优化策略，旨在推动深度学习在CPU平台上的高效部署和广泛应用。这将有助于广大开发者更好地利用CPU资源，为用户提供更快、更节能的计算体验。

资源详情

资源推荐

3.1.1 Single thread optimization

We started from optimizing CONV within one thread. CONV

is computationally-intensive which traverses its operands mul-

tiple times for computation. Therefore, it is critical to man-

age the layout of the data fed to the CONV to reduce the

memory access overhead. We ﬁrst revisit the computation

of CONV to illustrate our memory management scheme. A

2D CONV in CNN takes a 3D feature map (height

width

channels) and a number of 3D convolution kernels (nor-

mally smaller height and width but the same number of chan-

nels) to convolve to output another 3D tensor. The calculation

is illustrated in Figure 1, which implies loops of 6 dimen-

sions: in_channel, kernel_height, kernel_width, out_channel,

out_height and out_width. Each kernel slides over the input

feature map along the height and width dimensions, does

element-wise product and accumulates the values to produce

the corresponding element in the output feature map, which

can naturally leverage FMA. The number of kernels forms

out_channel. Note that three of the dimensions (in_channel,

kernel_height and kernel_width) are reduction axes that can-

not be embarrassingly parallelized.

in_height

in_width

kernel_width

kernel_height

out_width

out_height

out_channel

(# of kernel)

in_channel

ow_inner

inputs kernels

ZMM_0

ZMM_1 -

ZMM_{ow_inner}

DRAM

outputs

vectorized FMA

Figure 1: The illustration of CONV and the efﬁcient imple-

mentation in AVX-512 instructions as an example. There

are three kernels depicted in dark blue, green and light pink.

To do efﬁcient FMA, multiple kernel values are packed into

one

ZMM

values and accumulate to output values in different

ZMM

registers.

We use the conventional notation NCHW to describe the

default data layout, which means the input and output are 4-D

tensors with batch size N, number of channels C, feature map

height H, feature map width W, where N is the outermost and

W is the innermost dimension of the data. The related layout

of kernel is KCRS, in which K, C, R, S stand for the output

channel, input channel, kernel height and kernel width.

Following the common practice [25, 42], we organized

the feature map layout as NCHW[x]c for better memory ac-

cess patterns, in which c is a split sub-dimension of chan-

nel C in super-dimension, and the number x indicates the

split size of the sub-dimension (i.e.

#channels = sizeo f (C)×

sizeo f (c)

, where

sizeo f (c) = x

). The output has the same

layout NCHW[y]c as the input, while the split factor can be

different. Correspondingly, the convolution kernel is orga-

nized in KCRS[x]c[y]k, in which c with split size

and k

with split size

are the sub-dimensions of input channel C

and output channel K, respectively. It is worth noting that a

signiﬁcant amount of data transformation overhead needs to

be paid to get the desired layout.

In addition to the dimension reordering, for better uti-

lizing the latest vectorization instructions (e.g. AVX-512,

AVX2, NEON, etc.), we split out_width to ow_outer and

ow_inner using a factor reg_n and move the loop of ow_inner

inside for register blocking. For example, on a CPU fea-

tured AVX-512, we can utilize its 32 512-bit width registers

ZMM

− ZMM

[26] as follows. We maintain the loop hier-

archy to use one ZMM register to store the kernel data while

others storing the feature map. The kernel values stored in

one

ZMM

in ﬂoat32) are used to multiply with a number of input feature

map values continuously stored in the DRAM via AVX-512F

instructions [26], whose results are then accumulated to other

ZMM

registers storing the output values. Figure 1 illustrates

this idea. For other vectorized instructions, the same idea ap-

plies but the split factor of out_width (i.e. reg_n) may change.

Algorithm 1 summarizes our optimization of CONV in

single thread, which essentially is about 1) dimension order-

ing for friendly memory locality and 2) register blocking

for good vectorization instruction utilization, as in previous

works. However, unlike others, we made it a template in high-

level language (see supplementary material), in which the

block size (

), the number of utilized registers (reg_n), and

the loop-unroll strategy (

unroll_ker

) are easily conﬁgurable.

Consequently, the computing logic can be adjusted according

to different CPU architectures (cache size, registered vector

width, etc.) as well as different workloads (feature map size,

convolution kernel size, etc.). This is ﬂexible and enables

graph-level optimization we will discuss later.

3.1.2 Thread-level parallelization

It is a common practice to partition CONV into disjoint pieces

to parallelize among multiple cores of a modern CPU. Kernel

libraries like Intel MKL-DNN usually uses off-the-shelf multi-

threading solution such as OpenMP. However, we observe

that the resulting scalability of the off-the-shelf parallelization

solution is not desirable (Section 4.2.4).

Therefore, we implemented a customized thread pool to

efﬁciently process this kind of embarrassing parallelization.

剩余15页未读，继续阅读

david-yue

粉丝: 252
资源: 44

深度学习CNN在CPU全栈优化：提升3.45倍性能

基于CNN算法的深度学习研究及应用.pdf

基于深度学习的神经网络算法论文

基于CPU的深度学习推理部署优化实践

深度学习 cnn 设备诊断

深度学习CNN的优点和好处

基于python的深度学习CNN算法在无人驾驶中的应用

CNN深度学习模型可以构建出来的模型可以用数学公式表达吗

对于CNN深度学习模型卷积神经网络模型的假设

基于深度学习cnn的音乐推荐系统实现源码

深度学习框架和深度学习模型有什么区别

cnn模型python处理数值

信号识别基于matlab深度学习cnn信号调制分类

易康结合深度学习cnn面向对象分类

如何使用MATLAB的深度学习工具箱中的CNN工具箱构建CNN模型

模型训练：使用Cube AI框架来训练一个深度学习模型，以识别鱼类。你可以使用卷积神经网络（CNN）或其他类型的深度学习模型来训练。

我在app中想使用CNN模型，最好将这个CNN模型封装成什么文件呢

matlab深度学习cnn时序预测

pathon深度学习分类模型

深度学习算法模型优化的方向

对于给定的kaggle猫狗数据集，采用深度学习CNN

最新资源