Learning Conceptual-Contextual Embeddings for Medical Text
Xiao Zhang
1
∗
Dejing Dou
3,4
Ji Wu
1,2
1
Department of Electronic Engineering, Tsinghua University
2
Institute for Precision Medicine, Tsinghua University
3
Department of Computer and Information Science, University of Oregon
4
Baidu Research
xzhang19@mails.tsinghua.edu.cn
dou@cs.uoregon.edu, doudejing@baidu.com
wuji ee@mail.tsinghua.edu.cn
Abstract
External knowledge is often useful for natural language un-
derstanding tasks. We introduce a contextual text representa-
tion model called Conceptual-Contextual (CC) embeddings,
which incorporates structured knowledge into text represen-
tations. Unlike entity embedding methods, our approach en-
codes a knowledge graph into a context model. CC embed-
dings can be easily reused for a wide range of tasks in a
similar fashion to pre-trained language models. Our model
effectively encodes the huge UMLS database by leveraging
semantic generalizability. Experiments on electronic health
records (EHRs) and medical text processing benchmarks
showed our model gives a major boost to the performance
of supervised medical NLP tasks.
Introduction
External knowledge is often useful for language understand-
ing tasks. Especially in specialized domains like medicine,
it is unlikely to attain human-level performance in text un-
derstanding without referring to external domain knowl-
edge. Ontologies and knowledge graphs are the most com-
mon forms of domain knowledge, but due to their structured
nature, it is not straightforward to incorporate them with
representation-based neural models.
Current approaches usually bridge text and knowledge
graphs with retrieval. Triplets or entities are retrieved based
on occurrences of the text tokens in the entity descriptions.
After retrieval, triplets can be treated as text sequences and
be provided to the model as an extra input (Mihaylov and
Frank 2018). Another method is to use the corresponding
entity embeddings from a graph embedding model trained
on knowledge graphs (Huang et al. 2019). However one still
needs to deal with the aligning issue between entity embed-
dings and text representations.
In this paper, we take a novel approach which takes exter-
nal knowledge into the realm of text representation learning.
Word embeddings models like skip-gram (Mikolov et al.
2013a) and contextual embedding models like BERT (De-
vlin et al. 2018) have proved the crucial role of good text
∗
Work done while visiting the University of Oregon
Copyright
c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
representations in NLP tasks. Our model aims to incorporate
external knowledge into text representations, which makes
it easy to apply external knowledge and makes it robust to
variations of expression in text.
Our model, which we termed Conceptual-Contextual
(CC) Embeddings, is a contextual text representation model
similar to BERT. Instead of providing general text represen-
tations, CC embeddings are specifically designed to be “con-
cept aware.” The model is trained to recognize concept and
entity names in text and produce representations of those
concepts and entities. Knowledge from knowledge graphs is
encoded in the representations, which can be easily utilized
in NLP tasks. Like other contextual representation models,
CC embedding model can be used to generate embeddings
as features or fine-tuned for a supervised learning task.
The rest of the paper is organized as follows: we first for-
mulate our approach and discuss why it is particularly rele-
vant for the medical domain. Then we detail our model and
the process of encoding a large knowledge graph into con-
textual representations. Finally we evaluate on several tasks
to validate the effects of our CC embeddings.
Methodology
Model
… in premature infants
cortisone was predominant
compared with cortisol …
Cardiovascular involvement
in rheumatoid arthritis (RA)
is increasingly observed …
C0003873
C0010137
may_prevent
+ =
concept_a concept_brelation
“Cortisone” “Rheumatoid Arthritis”
Figure 1: Encoding concept mentions in text
The core component of CC embedding model is an en-
coder which encodes structured knowledge. The encoder
takes a written form of a concept as input, and outputs a
vector representation of the concept. The idea is illustrated
in Figure 1.
arXiv:1908.06203v3 [cs.CL] 12 Mar 2020