2021 Global Reliability and Prognostics and Health Management
(PHM-Nanjing)
A One-Dimensional Vision Transformer with Multi-
scale Convolution Fusion for Bearing Fault Diagnosis
Chaoyang Weng
School of Mechanical Engineering
Nanjing University of
Science and Technology
Nanjing, China
wengcy@njust.edu.cn
Baochun Lu
School of Mechanical Engineering
Nanjing University of
Science and Technology
Nanjing, China
lbcnust@sina.com
Jiachen Yao
School of Mechanical Engineering
Nanjing University of
Science and Technology
Nanjing, China
791344334@qq.com
Abstract—Aiming at the problem that traditional convolutional
neural networks (CNN) based fault diagnosis methods cannot
capture the temporal information of rolling bearings, a one-
dimensional Vision Transformer with Multiscale Convolution
Fusion (MCF-1DViT) is proposed in this paper. To automatically
and effectively enrich multiscale features from the collected
vibration signals, the multiscale convolution fusion (MCF) layer is
designed to capture the fault features in multiple time scales. Then,
the improved Vision Transformer architecture is introduced to
learn long-term time-related information with Transformer,
which can significantly improve the diagnosis accuracy and anti-
noise ability. Finally, experiments on a popular rolling bearing
dataset are implemented to validate the proposed method. The
results show that the proposed method can obtain superior
diagnosis performance compared with the existing methods.
Keywords- bearing fault diagnosis; one-dimensional; Vision
Transformers; multiscale; self-attention
I. INTRODUCTION
Rotating machinery has been widely used in modern industry.
In most cases, rotating machinery needs to work in harsh
environments and complex working conditions, which will lead
to various faults [1]. As a key component of rotating machinery,
rolling bearings account for 30% of all failures of rotating
components [2]. Failure of rolling bearing could cause huge
economic losses, and even endanger the safety of operators in
severe cases [3]. Therefore, it is necessary to find an effective
intelligent bearing fault diagnosis method.
Recently, deep learning technologies, as an effective method
of automatic feature extraction and classification, have been
widely applied in many fields such as machine vision and speech
recognition [4, 5]. Due to its attractive characteristic that can
automatically learn high-level representations of inputs without
manual feature extraction, deep learning technologies have been
applied into the area of fault diagnoses, such as deep belief
networks (DBNs) [6], convolutional neural networks (CNN) [7]
and residual convolution network (ResNet) [8]. Among these
deep learning methods, the CNN is a typical deep learning
architecture bases on the special multilayer perceptrons neural
network, which through convolution operations and pooling
operations to processing shift-invariant data [9]. Thereby, many
scholars utilize CNN to achieve bearing fault diagnosis. For
example, Chen et al. [7] used a map representations of Cyclic
Spectral Coherence as the input of CNN, and greatly improved
the recognition performance of bearing faults. Wang et al. [10]
combined symmetrized dot pattern with CNN for intelligent
bearing fault diagnosis. Wen et al. [11] eliminated the effect of
manual features by converting the signal into two-dimensional
(2D) images directly and fed it into a novel CNN-based mothed.
However, the vibration signals are usually one-dimensional (1D)
time-domain signals. Therefore, Zhang et al. [12] used raw
vibration signals as the input of deep CNN with wide first-layer
kernels, and get better robustness in complex environments.
Huang et al. [13] added different scales of kernel to the first layer
of CNN, which can obtain the distinguishable information in
multiple time scales adaptively. Liu et al. [14] proposed a multi-
scale kernel based Residual CNN architecture to capture the
fault features. In summary, the CNN-based methods can extract
the highly localized feature via kernels and achieve a certain
fault recognition accuracy. However, these methods have not
leveraged related information about the relative or absolute
position of the entire raw vibration signals sequence.
Unlike the CNN-based that typically use filters with a local
receptive field, a new type of deep learning model called
Transformers [15] have been proposed to relate spatially distant
concept through self-attention in token-space. The self-attention
can capture long-range relationships between the sequence’s
elements and judiciously allocate computation by attending to
important regions, instead of treating all points equally [16].
Thus, Transformer is currently considered state-of-the-art
models in sequential data, especially natural language
processing (NLP) methods. Inspired by the major success of
Transformer architectures in the field of NLP, researchers have
introduced Transformer to computer vision tasks. In particular,
the Vision Transformer (ViT) [17] was proposed to perform
classification by mapping a sequence of image patches to the
semantic label. The Transformer employed by the ViT can
process different regions of the image and integrate information
across the entire image.
Motivated by these observations, ViT has great potentials for
intelligent fault diagnosis. However, Transformer lack some
inductive biases inherent to CNNs, such as translation equal
variance and locality, which make it unable to generalize well
when trained on an insufficient amount of data. Moreover, the
vibration signals are the 1D time-domain signals, the 2D images
reshaped by the raw vibration signals cannot reflect the inherent
vibration information [18], which makes it difficult to learn
meaningful fault features directly.
2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing)
978-1-6654-0131-9/21/$31.00 ©2021 IEEE
2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing) | 978-1-6654-0131-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/PHM-Nanjing52125.2021.9612919
Authorized licensed use limited to: China University of Petroleum. Downloaded on July 15,2022 at 08:00:57 UTC from IEEE Xplore. Restrictions apply.