Large-Scale Hierarchical Text Classification with
Recursively Regularized Deep Graph-CNN
Hao Peng
*
Jianxin Li
*
Yu He
*
Yaopeng Liu
*
Mengjiao Bao
*
Lihong Wang
†
Yangqiu Song
‡
Qiang Yang
‡
*
School of Computer Science & Engineering, Beihang University, Beijing, China
†
National Computer Network Emergency Response Technical Team/Coordination Center of China
‡
Department of Computer Science & Engineering, Hong Kong University of Science and
Technology, Hong Kong
{penghao,lijx,heyu,liuyp,baomj}@act.buaa.edu.cn wlh@isc.org.cn
{yqsong,qyang}@cse.ust.hk
ABSTRACT
Text classification to a hierarchical taxonomy of topics is a common
and practical problem. Traditional approaches simply use bag-of-
words and have achieved good results. However, when there are
a lot of labels with different topical granularities, bag-of-words
representation may not be enough. Deep learning models have
been proven to be effective to automatically learn different levels of
representations for image data. It is interesting to study what is the
best way to represent texts. In this paper, we propose a graph-CNN
based deep learning model to first convert texts to graph-of-words,
and then use graph convolution operations to convolve the word
graph. Graph-of-words representation of texts has the advantage
of capturing non-consecutive and long-distance semantics. CNN
models have the advantage of learning different level of semantics.
To further leverage the hierarchy of labels, we regularize the deep
architecture with the dependency among labels. Our results on
both RCV1 and NYTimes datasets show that we can significantly
improve large-scale hierarchical text classification over traditional
hierarchical text classification and existing deep models.
1 INTRODUCTION
Topical text classification is a fundamental text mining problem
for many applications, such as news classification [
18
], question
answering [
28
], search result organization [
11
], online advertis-
ing [
2
], etc. When there are many labels, hierarchical catego-
rization of texts has been recognized as a natural and effective
way to organize texts and it has been well studied in the past two
decades [
5
,
6
,
13
,
30
,
39
,
43
,
44
]. Most of the above traditional
approaches represent text as sparse lexical features such as bag-of-
words (BOW) and/or n-grams due to simplicity and effectiveness [
1
].
Different kinds of feature engineering, such as none-phrases or key-
phrases, were shown no significant improvement on themselves,
while the majority voting over the results of different features are
significantly better [36].
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
WWW2018, Lony, France
© 2018 Copyright held by the owner/author(s). 123-4567-24-567/08/06. .. $15.00
DOI: 10.475/123
4
Recently, deep learning has been proven to be effective to perform
end-to-end learning of hierarchical feature representations, and has
made groundbreaking progress on object recondition in computer
vision and speech recognition problems [
24
]. Two popular deep
learning architectures have attracted more attention for text data,
i.e., recurrent neural networks (RNNs) [
3
,
17
]
1
and convolutional
neural networks (CNNs) [
8
,
25
]. RNNs are more powerful on short
messages or word level syntactics or semantics [
3
]. When they are
applied to long documents, hierarchical RNNs can be developed [
40
].
However, hierarchical RNNs assume that the documents and sen-
tences are considered as natural boundaries for the definition of the
hierarchy where only regular texts and formal languages satisfy this
constraint. Different from RNNs, CNNs use convolutional masks
to sequentially convolve over the data. For texts, a simple mecha-
nism is to recursively convolve the nearby lower-level vectors in the
sequence to compose higher-level vectors [
8
]. This way of using
CNNs simply evaluates the semantic compositionality of consec-
utive words, which corresponds to the n-grams used in traditional
text modeling [
1
]. Similar to images, such convolution can naturally
represent different levels of semantics shown by the text data. Higher
level represents semantics captured by larger “n”-grams.
For document-level topical classification of texts, the sequential
information of words might not be as important as it is for language
models [
3
] or sentiment analysis [
45
]. For example, when we write
“I love this restaurant! I think it is good. It has great sandwich. But
the service may not be very efficient sine there are always a lot of
people...”, we can easily identify it’s topic as “food” but sentiment
analysis should be conducted more carefully since there is a word
“but.” For topic classification, the key words, phrases, and their
composition are more important. In this case, rather than sequential
information, the non-consecutive phrases and long-distance word
dependency are more important for computing the composition of
semantics. For example, in a document, the words “restaurant” and
“sandwich” may not co-occur in a small window. However, “menu”
may co-occur with both of them somewhere else in the document,
and the composition of all of three words is a very strong signal to
classify the document to be “food” related topics. Therefore, a more
1
Here we ignore the discussion of recursive neural networks [
38
] since it requires
knowing the tree structure of text, which is not as efficient as the others when dealing
with large scale data.