TaxoGen: Unsupervised Topic Taxonomy Construction by
Adaptive Term Embedding and Clustering
Chao Zhang
1
, Fangbo Tao
2
, Xiusi Chen
3
, Jiaming Shen
1
, Meng Jiang
4
,
Brian Sadler
5
, Michelle Vanni
5
, and Jiawei Han
1
1
Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
2
Facebook Inc., Menlo Park, CA, USA
3
Dept. of Computer Science and Technology, Peking University, Beijing, China
4
Dept. of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA
5
U.S. Army Research Laboratory, Adelphi, MD, USA
1
{czhang82, js2, hanj}@illinois.edu
2
fangbo.tao@gmail.com
3
xiusi0721@gmail.com
4
mjiang2@nd.edu
5
{brian.m.sadler6.civ, michelle.t.vanni.civ}@mail.mil
ABSTRACT
Taxonomy construction is not only a fundamental task for semantic
analysis of text corpora, but also an important step for applications
such as information ltering, recommendation, and Web search.
Existing pattern-based methods extract hypernym-hyponym term
pairs and then organize these pairs into a taxonomy. However, by
considering each term as an independent concept node, they over-
look the topical proximity and the semantic correlations among
terms. In this paper, we propose a method for constructing topic
taxonomies, wherein every node represents a conceptual topic and
is dened as a cluster of semantically coherent concept terms. Our
method, TaxoGen, uses term embeddings and hierarchical cluster-
ing to construct a topic taxonomy in a recursive fashion. To ensure
the quality of the recursive process, it consists of: (1) an adaptive
spherical clustering module for allocating terms to proper levels
when splitting a coarse topic into ne-grained ones; (2) a local
embedding module for learning term embeddings that maintain
strong discriminative power at dierent levels of the taxonomy. Our
experiments on two real datasets demonstrate the eectiveness of
TaxoGen compared with baseline methods.
ACM Reference Format:
Chao Zhang
1
, Fangbo Tao
2
, Xiusi Chen
3
, Jiaming Shen
1
, Meng Jiang
4
, Brian
Sadler
5
, Michelle Vanni
5
, and Jiawei Han
1
. 2018. TaxoGen: Unsupervised
Topic Taxonomy Construction by Adaptive Term Embedding and Clustering.
In KDD 2018: 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, August 19–23, 2018, London, United Kingdom.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3219819.3220064
1 INTRODUCTION
Automatic taxonomy construction from a text corpus is a fundamen-
tal task for semantic analysis of text data and plays an important
role in many applications. For example, organizing a massive news
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
KDD 2018, August 19–23, 2018, London, United Kingdom
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5552-0/18/08.. . $15.00
https://doi.org/10.1145/3219819.3220064
Computer Science
computer_science
computation_time
algorithm
computation
computation_approach
Information Retrieval
information_retrieval
ir
information_filtering
text_retrieval
retrieval_effectiveness
…
…
Machine Learning
machine_learning
learning_algorithms
clustering
reinforcement_learning
classification
Figure 1: An example topic taxonomy. Each node is a clus-
ter of semantically coherent concept terms representing a
conceptual topic.
corpus into a well-structured taxonomy allows users to quickly
navigate to their interested topics and easily acquire useful infor-
mation. As another example, many recommender systems involve
items with textual descriptions, and a taxonomy for these items
can help the system better understand user interests to make more
accurate recommendations [32].
Existing methods mostly generate a taxonomy wherein each
node is a single term representing an independent concept [
13
,
18
].
They use pre-dened lexico-syntactic patterns (e.g., A such as B,
A is a B) to extract hypernym-hyponym term pairs, and then or-
ganize these pairs into a concept taxonomy by considering each
term as a node. Although they can achieve high precision for the
extracted hypernym-hyponym pairs, considering each term as an
independent concept node causes three critical problems to the
taxonomy: (1) low coverage: Since term correlations are not con-
sidered, only the pairs exactly matching the pre-dened patterns
are extracted, which leads to low coverage of the result taxonomy.
(2) high redundancy: As one concept can be expressed in dierent
ways, the taxonomy is highly redundant because many nodes are
just dierent expressions of the same concept (e.g., ‘information
retrieval’ and ‘ir’). (3) limited informativeness: Representing a node
with a single term provides limited information about the concept
and causes ambiguity.