A Novel Topic Model for Automatic Term Extraction
Sujian Li
1
Jiwei Li
1
Tao Song
1
Wenjie Li
2
Baobao Chang
1
1
Key Laboratory of Computational Linguistics (Peking University), Ministry of Education,
School of Electronics Engineering and Computer Science, CHINA
2
The Innovative Intelligent Computing Center,
The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, CHINA
{lisujian, bdlijiwei, stao, chbb} @pku.edu.cn; cswjli@comp.polyu.edu.hk
ABSTRACT
Automatic term extraction (ATE) aims at extracting domain-
specific terms from a corpus of a certain domain. Termhood is one
essential measure for judging whether a phrase is a term. Previous
researches on termhood mainly depend on the word frequency
information. In this paper, we propose to compute termhood
based on semantic representation of words. A novel topic model,
namely i-SWB, is developed to map the domain corpus into a
latent semantic space, which is composed of some general topics,
a background topic and a documents-specific topic. Experiments
on four domains demonstrate that our approach outperforms the
state-of-the-art ATE approaches.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing – linguistic processing, Thesauruses.
General Terms
Algorithms, Experimentation.
Keywords
Term Extraction, Topic Model, Termhood.
1. INTRODUCTION
So far, most researches on automatic term extraction have been
guided by two essential measures defined by [6], namely unithood
and termhood. Unithood examines syntactic formation of terms or
the degree (or significance) of the association among the term
constituents. Termhood, on the other hand, aims to capture the
semantic relatedness of a term to a domain concept. However,
there is no uniform definition of what is semantic relatedness, and
how to compute termhood is still an open problem.
Previous researches have attempted to measure termhood by
applying several statistical measures within a domain or across
domains, such as TF-IDF, C-value/NC-value [5], co-occurrence
[4] and inter-domain entropy [2]. These statistical measures often
ignore the informative words with very high frequency or very
low frequency and do not take into account the semantics carried
by terms. Taking the term “NRZ electrical input” in the electric
engineering domain for example, “NRZ” only occurs in a few
documents while “electrical” occurs in many documents
frequently. Using TF-IDF to measure termhood, low scores will
be assigned to both “NRZ” and “electrical”, which in turn causes
the term “NRZ electrical input” to have a low termhood. It is
obvious that frequency-based measures will keep many real terms
out of the door. In fact, a domain is described semantically from
various aspects. Again, let’s take the electric engineering domain
for example. The words like “input” emphasize some specific
topic in a domain while the words like “electrical” provide the
background of that domain. There also exist a cluster of words
like “NRZ” which occur in the corpus infrequently, but tend to
occur in a few documents frequently. Such words can reflect
some special characteristics of the domain. Based on these
observations, we argue that three semantic aspects can be used in
the representation of words: Domain background words (e.g.
electrical) describe the domain in general. Domain topic words
(e.g. input) represent a certain topic in a given domain. Domain
documents-specific words (e.g. NRZ) are specific to a small
number of documents and exhibit the characteristics of the
domain. We assume that a term can be recognized by identifying
whether its constituent words belong to some of the three
semantic aspects.
As for semantic representation of words, unsupervised topic
models have shown their advantages [1] [3]. Latent Dirichlet
Allocation (LDA) is a well-known example of such models. It
posits that each document can be seen as a mixture of latent topics
and each topic as the distribution over a given vocabulary. To
trade-off generality and specificity of words, Chemudugunta et al.
[3] further defined the special words with background (SWB)
model that allowed words to be modeled as originating from
general topics, or document-specific topics, or a corpus-wide
background topic. The existing work proves that topic models are
competent for the semantic representation of words. However, to
our knowledge, no prior work has introduced such kind of
semantic representation to term extraction.
Inspired by Chemudugunta’s idea of generality and specificity [3],
in this paper we propose a novel topic model, namely i-SWB to
model the three suggested semantic aspects. In i-SWB, three
kinds of topics, namely bac
kground topic, general topics, and
documents-specific topic are correspondingly constructed to
generate the words in a domain corpus. Compared with
Chemudugunta’s SWB model, there are two main improvements
in i-SWB to tailor to term extraction. First, specificity in i-SWB is
modeled at the corpus level and one documents-specific topic is
set to identify a cluster of idiosyncratic words from the whole
corpus. Thus, i-SWB avoids the computationally intensive
problem in SWB where the number of document-specific topics
grows linearly with the number of documents. Second, i-SWB
makes use of both document frequency (DF) and topic
information to control the generation of words, while SWB only
uses a simple multinomial variable to control which topic a word
is generated from. This improvement comes from the following
findings that have been verified in the experiments: the words
occurring in many documents and distributing over many general
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGIR’13, July 28–August 1, 2013, Dublin, Ireland.
Copyright © 2013 ACM 978-1-4503-2034-4/13/07…$15.00.