input feature space.
The final type of deep learning architectures that is utilized
in RMDL model is Recurrent Neural Networks (RNN) where
outputs from the neurons are fed back into the network as
inputs for the next step. Some recent extensions to this
architecture uses Gated Recurrent Units (GRUs) [5] or Long
Short-Term Memory (LSTM) units [24]. These new units
help control for instability problems in the original network
architecture. RNN have been successfully used for natural
language processing [25]. Recently, Z. Yang et al.
in 2016 [26] developed hierarchical attention networks for
document classification. These networks have two important
characteristics: hierarchical structure and an attention
mechanism at word and sentence level.
New work has combined these three basic models of the
deep learning structure and developed a novel technique for
enhancing accuracy and robustness. The work of M. Turan et
al. in 2017 [7] and M. Liang et al.
in 2015 [27] implemented
innovative combinations of CNN and RNN called A
Recurrent Convolutional Neural Network (RCNN). K.
Kowsari et al. in 2017 [1] introduced hierarchical deep
learning for text classification (HDLTex) which is a
combination of all deep learning techniques in a hierarchical
structure for document classification has improved accuracy
over traditional methods. The work in this paper builds on
these ideas, specifically the work of [1] to provide a more
general approach to supervised learning for classification.
III. BASELINES
In this paper, we use both contemporary and traditional
techniques of document and image classification as our
baselines. The baselines of image and text classification are
different due to feature extraction and structure of model;
thus, text and image classification’s baselines are described
separately as follows:
A. Text Classification Baselines
Text classification techniques which are used as our
baselines to evaluate our model are as follows: regular deep
models such as Recurrent Neural Networks (RNN),
Convolutional Neural Networks (CNN), and Deep Neural
Networks (DNN). Also, we have used two different
techniques of Support Vector Machine (SVM), naive bayes
classification (NBC), and finally Hierarchical Deep Learning
for Text Classification (HDLTex) [1].
1) Deep learning
The baseline, we used in this paper is Deep Learning
without Hierarchical level. One of our baselines for text
classification is [26]. In our methods’ Section V, we will
explain the basic models of deep learning such as DNN,
CNN, and RNN which are used as part of RMDL model.
2) Support Vector Machine (SVM)
The original version of SVM was introduced by Vapnik,
VN and Chervonenkis, A Ya [28] in 1963. The early 1990s,
nonlinear version was addressed in [29].
Multi-class SVM:
The basic SVM is used for binary classification, so for
multi class we need to generate Multimodel or MSVM.
One-Vs-One is a technique for multi-class SVM and needs to
build N(N-1) classifiers.
such that:
(5)
where
is training data point such that
D. C is
the penalty parameter, ζ is slack parameter, k stands for
classes, and w is learning parameters
Another technique of multi-class classification using SVM
is All-against-One. In SVM many methods for feature
extraction have been addressed [32], but we are using two
technique word sequences feature extracting [33], and Term
frequency-inverse document frequency (TF-IDF).
Stacking Support Vector Machine (SVM): We use
Stacking SVMs as another baseline method for comparison
with RMDL for datasets which has capability to use
hierarchical labels. The stacking SVM provides an ensemble
of individual SVM classifiers and generally produces more
accurate results than single-SVM models [34], [35].
3) Naive Bayes Classification (NBC)
This technique has been used in industry and academia for
a long time, and it is the most traditional method of text
categorization which is widely used in Information Retrieval
[36]. If the number of n documents, fit into k categories the
predicted class as output is c C. Naive bayes is a simple
algorithm which uses bayes’ rule described as follows:
(6)
where d is document, and c indicates a class.
(7)
The baseline of this paper is word level of NBC [37]. Let
be the parameter for word j, then
(8)