bottleneck in Eqn. 10. We see that we have exactly
the same compression term for each latent factor,
I(X : Y
j
), but the relevance variables now corre-
spond to Z ≡ X
i
. In other words, CorEx has multi-
ple relevance terms, one for each word in the vocab-
ulary, so that CorEx will prefer representations that
are relevant for as many words as possible.
Inspired by the success of the bottleneck, we sug-
gest that if we want to learn representations that are
more relevant to specific keywords, we can simply
anchor a word type X
i
to topic Y
j
, by constraining
our optimization so that α
i,j
= β
i,j
, where β
i,j
≥ 1
controls the anchor strength. Otherwise, the updates
on α remain the same. This schema is a natural ex-
tension of the CorEx optimization and it is flexible,
allowing for multiple word types to be anchored to
one topic, for one word type to be anchored to multi-
ple topics, or for any combination of these anchoring
strategies. Furthermore, it combines supervised and
unsupervised learning by allowing us to leave some
topics without anchors.
3 Related Work
With respect to integrating domain knowledge into
topic models, we draw inspiration from Arora et
al., who used anchor words in the context of non-
negative matrix factorization (2012). Using an as-
sumption of separability, these anchor words act
as high precision markers of particular topics and,
thus, help discern the topics from one another. Al-
though the original algorithm proposed by Arora et.
al and subsequent improvements to their approach
find these anchor words automatically (Arora et al.,
2013; Lee and Mimno, 2014), recent adaptations
allow manual insertion of anchor words and other
metadata (Nguyen et al., 2014; Nguyen et al., 2015).
Our work is similar to the latter, where we treat an-
chor words as fuzzy logic markers and embed them
into the topic model in a semi-supervised fashion. In
this sense, our work is closest to Halpern et al., who
have also made use of domain expertise and semi-
supervised anchored words in devising topic models
(2014; 2015).
There is an adjacent line of work that has focused
on incorporating word-level information into LDA-
based models. Jagarlamudi et. al proposed Seed-
edLDA, a model that seeds words into given topics
and guides, but does not force, these topics towards
these integrated words (2012). Andrezejewski and
Zhu presented two flavors of semi-supervised topic
models. The first makes use of “z-labels,” words
that are known to pertain to specific topics and that
are restricted to appearing in some subset of all the
possible topics (2009). Although the z-labels can
be leveraged to place different senses of a word into
different topics, it requires additional effort to deter-
mine when these different senses occur. Our anchor-
ing approach allows a user to more easily anchor one
word to multiple topics, allowing CorEx to naturally
find topics that revolve around different senses of a
word.
The second model from Andrezejewski and Zhu
allows specification of Must-Link and Cannot-Link
relationships between words that help partition oth-
erwise muddled topics (Andrzejewski et al., 2009).
These logical constraints help enforce topic separa-
bility, though these mechanisms less directly address
how to anchor a single word or set of words to help a
topic emerge. More generally, the Must/Cannot link
and z-label topic models have been expressed in a
powerful first-order-logic framework that allows the
specification of arbitrary domain knowledge through
logical rules (Andrzejewski et al., 2011). Others
have built off this first-order-logic approach to au-
tomatically learn rule weights (Mei et al., 2014)
and incorporate additional latent variable informa-
tion (Foulds et al., 2015).
Mathematically, CorEx topic models most closely
resemble topic models based on latent tree recon-
struction (Chen et al., 2015). In Chen et. al.’s anal-
ysis, their own latent tree approach and CorEx both
report significantly better perplexity than hierarchi-
cal topic models based on the hierarchical Dirichlet
process and the Chinese restaurant process. CorEx
has also been investigated as a way to find “surpris-
ing” documents (Hodas et al., 2015).
4 Data and Evaluation Methods
4.1 Data
In understanding how to improve topic modeling
through domain knowledge, we use two challenging
datasets with corresponding domain knowledge lex-
icons. Our first dataset consists of 504,000 human-
itarian assistance and disaster relief (HA/DR) arti-