The MeSH-gram Neural Network Model: Extending Word Embedding Vectors with
MeSH Concepts for UMLS Semantic Similarity and Relatedness in the Biomedical
Domain
Saïd Abdeddaïm
a
, Sylvestre Vimard
a
, Lina F. Soualmia
a
a
Normandie Univ., UNIROUEN, UNIHAVRE, INSA Rouen, LITIS, F-76000, Rouen, France,
Abstract
Eliciting semantic similarity between concepts remains a challenging task. Recent approaches founded on embedding vectors have
gained in popularity as they risen to efficiently capture semantic relationships. The underlying idea is that two words that have close
meaning gather similar contexts. In this study, we propose a new neural network model named “MeSH-gram” which relies on a
straightforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings)
descriptors instead words. Trained on publicly available PubMed/MEDLINE corpus, MesSH-gram is evaluated on reference
standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at
several windows’ contexts. A deeper comparison is performed with twenty existing models. All the obtained results of Spearman’s
rank correlations between human scores and computed similarities show that MeSH-gram (i) outperforms the skip-gram model, and
(ii) is comparable to the best methods but that need more computation and external resources.
Introduction
Eliciting semantic similarity and relatedness between concepts is a major issue in the biomedical domain. Different measures have
been proposed the last decades [1]. Those measures quantify the degree to which two concepts are similar. They either rely on
knowledge-based approaches using ontologies and terminologies, or corpus-based approaches which are founded on distributional
statistics (e.g. literature-based drug discovery [2-5]). Several clinical applications of importance rely on semantic similarity and
relatedness [6], such as biomedical information extraction and retrieval, clinical decision support, or disease prediction. For instance,
biomedical information extraction and retrieval is improved by including semantically related terms and concepts [7-10]. The same
approaches are used in the task of summarizing Electronic Health Records [11,12] and in document clustering [13]. The prediction of
disease-causing genes and disease prediction from similar genes [14,15] rely on the identification of similar dideases [16], or genes
[17]. Other applications include drug re-purposing [18,19] and drug interaction [20].
The recent approaches that have given better results in semantic similarity and relatedness measures are founded on word embedding
vectors computed by neural networks. Indeed, such architectures implemented initially by word2vec [21], have gained in popularity in
the biomedical domain as they risen to efficiently capture semantic similarity and relatedness relationships between words and
concepts [22-27]. Word embeddings is based on neural network language modeling where words are mapped to fixed-dimension
vectors of real numbers. The similarity between words can thus be measured by the (cosine) similarity between vectors that are
constructed over a training corpus. All co-occurrences of a word and its neighbors (i.e. contexts) within a predefined window size are
considered. The idea behind those representation learning approaches is that two words that have close meaning have generally
similar contexts [28]. For example, the words “Epilepsy” and “Convulsion” will both have “Brain” and “Mind” as neighbors.
word2vec developed by Mikolov et al.[21] is a neural network language model that learns word vectors that either maximizes the
probability of a word given the surrounding context, referred to as the CBOW approach (Continuous Bag Of Words) approach, or to
maximize the probability of the context given a word, referred to as the skip-gram approach.
In this study we propose a new method, named “MeSH-gram”, which relies on a straightforward approach: it computes the word
vectors by only using the MeSH (Medical Subject Headings) descriptors that are already included in the MEDLINE/PubMed corpus.
The MeSH-gram model extends the skip-gram neural network model used in word2vec [21] and fastText tools [29]. fastText is a
successful reimplementation of word2vec which is designed to compute the vector of each word using its neighbors. The extension
we propose in the MeSH-gram model replaces the neighbors by the MeSH descriptors of the abstract where each word occurs
Related Works
Several semantic similarity and relatedness measures have been proposed the last decades [27]. Many of them have been implemented
in the UMLS::Similarity package [30] avalaible in the UMLS (Unified Medical Language System). They differ on the method used:
path-based, content-based, UMLS-based, corpus-based, and more recently, methods based on word vectors and concepts vectors.
Path-based measures [7] use the hierarchical structure of a taxonomy to measure similarity: concepts close to each other are more
similar. For instance, Sajadi et al. [31,32] developed a ranking algorithm based on Wikipedia graph metrics and used it to compare
biomedical concepts. Content-based information measures [33,34] quantify the amount of information a concept provides: the more