Knowledge-driven Encode, Retrieve, Paraphrase for
Medical Image Report Generation
Christy Y. Li
∗1
, Xiaodan Liang
†2
, Zhiting Hu
2
, Eric P. Xing
3
1
Duke University,
2
Carnegie Mellon University ,
3
Petuum, Inc
yl558@duke.edu, {xiaodan1,zhitingh}@cs.cmu.edu, eric.xing@petuum.com.
Abstract
Generating long and semantic-coherent reports to describe
medical images poses great challenges towards bridging vi-
sual and linguistic modalities, incorporating medical domain
knowledge, and generating realistic and accurate descrip-
tions. We propose a novel Knowledge-driven Encode, Re-
trieve, Paraphrase (KERP) approach which reconciles tradi-
tional knowledge- and retrieval-based methods with modern
learning-based methods for accurate and robust medical re-
port generation. Specifically, KERP decomposes medical re-
port generation into explicit medical abnormality graph learn-
ing and subsequent natural language modeling. KERP first
employs an Encode module that transforms visual features
into a structured abnormality graph by incorporating prior
medical knowledge; then a Retrieve module that retrieves text
templates based on the detected abnormalities; and lastly, a
Paraphrase module that rewrites the templates according to
specific cases. The core of KERP is a proposed generic imple-
mentation unit—Graph Transformer (GTR) that dynamically
transforms high-level semantics between graph-structured
data of multiple domains such as knowledge graphs, images
and sequences. Experiments show that the proposed approach
generates structured and robust reports supported with ac-
curate abnormality description and explainable attentive re-
gions, achieving the state-of-the-art results on two medical
report benchmarks, with the best medical abnormality and
disease classification accuracy and improved human evalu-
ation performance.
Introduction
Beyond the traditional image captioning task (Xu et al.
2015; Karpathy and Fei-Fei 2015; Rennie et al. 2017) that
produces single-sentence descriptions, generating long and
semantic-coherent stories or reports to describe visual con-
tents (e.g., images, videos) has recently attracted increas-
ing research interests (Liang et al. 2017; Huang et al. 2016;
Krause et al. 2017), and is posed as a more challeng-
ing and realistic goal towards bridging visual patterns with
human linguistic descriptions. Particularly, an outstanding
challenge in modeling long narrative from visual content is
∗
This work was done when Christy Y. Li was at Petuum, Inc.
†
Corresponding author.
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
to balance between knowledge discovery and language mod-
eling (Karpathy and Fei-Fei 2015). Current visual text gen-
eration approaches tend to generate plausible sentences that
look natural by the language model but poor at finding vi-
sual groundings. Although some approaches have been pro-
posed to alleviate this problem (Lu et al. 2018; Anderson et
al. 2018; Liang et al. 2017), most of them ignore the inter-
nal knowledge structure of the task at hand. However, most
real-world data and problems exhibit complex and dynamic
structures such as intrinsic relations among discrete enti-
ties under nature’s law (Taskar, Guestrin, and Koller 2004;
Hu et al. 2016; Strubell et al. 2018). Knowledge graph, as
one of the most powerful representations of dynamic graph-
structured knowledge (Mitchell et al. 2018; Bizer, Heath,
and Berners-Lee 2011), complements the learning-based ap-
proaches by explicitly modeling the domain-specific knowl-
edge structure and relational inductive bias. Knowledge
graph also allows incorporating priors, which is proven
useful for tasks where universal knowledge is desired or
certain constraints have to be met (Battaglia et al. 2017;
Liang, Hu, and Xing 2018; Hu et al. 2018; X. Wang 2018).
As an emerging task of long text generation of practi-
cal use, medical image report generation (Li et al. 2018;
Jing, Xie, and Xing 2018) must satisfy more critical proto-
cols and ensure the correctness of medical terminology us-
age. As shown in Figure 1, a medical report consists of a
finding section describing medical observations in details of
both normal and abnormal features, an impression or conclu-
sion sentence indicating the most prominent medical obser-
vation, and peripheral sections such as patients information
and indications. Among these sections, the finding section
is considered as the most important component and is ex-
pected to 1) cover contents of key relevant aspects such as
heart size, lung opacity, and bone structure; 2) correctly de-
tect any abnormalities and support with details such as the
location and shape of the abnormality; 3) describe potential
diseases such as effusion, pneumothorax and consolidation.
It is often observed that, to write a medical image report,
radiologists first check a patient’s images for abnormal find-
ings, then write reports by following certain patterns and
templates, and adjusting statements in the templates for each
individual case when necessary (Hong and Kahn 2013). To
mimic this procedure, we propose to formulate medical re-
port writing as a knowledge-driven encode, retrieve, para-
arXiv:1903.10122v1 [cs.CV] 25 Mar 2019