MeSH-gram模型：生物医学领域的语义相似度与关联性提升

需积分: 9 144 浏览量更新于2024-09-09 收藏 407KB PDF 举报

"本文介绍了一种名为‘MeSH-gram神经网络模型’的新方法，该模型将词嵌入向量扩展到MeSH（Medical Subject Headings）概念，以提高生物医学领域UMLS（Unified Medical Language System）语义相似性和相关性的计算能力。作者通过在PubMed/MEDLINE公开语料库上训练MeSH-gram，并将其与现有模型进行对比，证明了其在捕捉语义关系方面的优越性。" 在生物医学领域，识别和度量概念之间的语义相似性是一项关键任务，对于信息检索、文本挖掘和知识发现至关重要。传统的基于词汇的方法常常受到词汇表征局限性的限制，而近年来，基于词嵌入的模型逐渐成为解决这一问题的有效手段。词嵌入模型如Word2Vec的skip-gram模型，通过学习词汇在文本中的上下文分布来捕获词汇的语义信息，使得在语义空间中，含义相近的词会靠近。 MeSH-gram神经网络模型是对skip-gram模型的一种扩展，它创新性地使用MeSH术语代替单词作为模型的基础单元。MeSH是美国国立医学图书馆创建的一套标准化医学主题词表，包含了丰富的生物医学概念，能够更好地表达专业领域的语义。通过使用MeSH术语，MeSH-gram模型可以捕获更精确的语义信息，尤其适用于生物医学文献的分析。为了评估MeSH-gram模型的性能，研究者使用了手动注释的参考标准来测量语义相似性。模型不仅与基础的skip-gram模型（使用300维的向量和不同大小的上下文窗口）进行了比较，还与20个现有的语义相似性模型进行了深度对比。这种全面的比较有助于验证MeSH-gram模型在捕获语义关系上的优势和潜在改进的空间。通过在PubMed/MEDLINE大规模语料库上训练，MeSH-gram模型能够利用大量医学文献中的上下文信息，进一步优化MeSH概念的表示。这不仅提高了模型的泛化能力，也有助于处理生物医学领域特有的长尾词汇和专业术语。 MeSH-gram模型通过整合MeSH概念，提供了一种增强的语义表示方法，对于提升生物医学文本的语义理解具有重要意义。这种方法可能为后续的文本分析任务，如疾病分类、药物发现和基因功能预测等带来显著的改进。未来的研究可能会探索如何进一步优化模型结构，以及如何将MeSH-gram模型与其他自然语言处理技术结合，以解决更复杂的生物医学信息处理挑战。

The MeSH-gram Neural Network Model: Extending Word Embedding Vectors with

MeSH Concepts for UMLS Semantic Similarity and Relatedness in the Biomedical

Domain

Saïd Abdeddaïm

, Sylvestre Vimard

, Lina F. Soualmia

Normandie Univ., UNIROUEN, UNIHAVRE, INSA Rouen, LITIS, F-76000, Rouen, France,

Abstract

Eliciting semantic similarity between concepts remains a challenging task. Recent approaches founded on embedding vectors have

gained in popularity as they risen to efficiently capture semantic relationships. The underlying idea is that two words that have close

meaning gather similar contexts. In this study, we propose a new neural network model named “MeSH-gram” which relies on a

straightforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings)

descriptors instead words. Trained on publicly available PubMed/MEDLINE corpus, MesSH-gram is evaluated on reference

standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at

several windows’ contexts. A deeper comparison is performed with twenty existing models. All the obtained results of Spearman’s

rank correlations between human scores and computed similarities show that MeSH-gram (i) outperforms the skip-gram model, and

(ii) is comparable to the best methods but that need more computation and external resources.

Introduction

Eliciting semantic similarity and relatedness between concepts is a major issue in the biomedical domain. Different measures have

been proposed the last decades [1]. Those measures quantify the degree to which two concepts are similar. They either rely on

knowledge-based approaches using ontologies and terminologies, or corpus-based approaches which are founded on distributional

statistics (e.g. literature-based drug discovery [2-5]). Several clinical applications of importance rely on semantic similarity and

relatedness [6], such as biomedical information extraction and retrieval, clinical decision support, or disease prediction. For instance,

biomedical information extraction and retrieval is improved by including semantically related terms and concepts [7-10]. The same

approaches are used in the task of summarizing Electronic Health Records [11,12] and in document clustering [13]. The prediction of

disease-causing genes and disease prediction from similar genes [14,15] rely on the identification of similar dideases [16], or genes

[17]. Other applications include drug re-purposing [18,19] and drug interaction [20].

The recent approaches that have given better results in semantic similarity and relatedness measures are founded on word embedding

vectors computed by neural networks. Indeed, such architectures implemented initially by word2vec [21], have gained in popularity in

the biomedical domain as they risen to efficiently capture semantic similarity and relatedness relationships between words and

concepts [22-27]. Word embeddings is based on neural network language modeling where words are mapped to fixed-dimension

vectors of real numbers. The similarity between words can thus be measured by the (cosine) similarity between vectors that are

constructed over a training corpus. All co-occurrences of a word and its neighbors (i.e. contexts) within a predefined window size are

considered. The idea behind those representation learning approaches is that two words that have close meaning have generally

similar contexts [28]. For example, the words “Epilepsy” and “Convulsion” will both have “Brain” and “Mind” as neighbors.

word2vec developed by Mikolov et al.[21] is a neural network language model that learns word vectors that either maximizes the

probability of a word given the surrounding context, referred to as the CBOW approach (Continuous Bag Of Words) approach, or to

maximize the probability of the context given a word, referred to as the skip-gram approach.

In this study we propose a new method, named “MeSH-gram”, which relies on a straightforward approach: it computes the word

vectors by only using the MeSH (Medical Subject Headings) descriptors that are already included in the MEDLINE/PubMed corpus.

The MeSH-gram model extends the skip-gram neural network model used in word2vec [21] and fastText tools [29]. fastText is a

successful reimplementation of word2vec which is designed to compute the vector of each word using its neighbors. The extension

we propose in the MeSH-gram model replaces the neighbors by the MeSH descriptors of the abstract where each word occurs

Related Works

Several semantic similarity and relatedness measures have been proposed the last decades [27]. Many of them have been implemented

in the UMLS::Similarity package [30] avalaible in the UMLS (Unified Medical Language System). They differ on the method used:

path-based, content-based, UMLS-based, corpus-based, and more recently, methods based on word vectors and concepts vectors.

Path-based measures [7] use the hierarchical structure of a taxonomy to measure similarity: concepts close to each other are more

similar. For instance, Sajadi et al. [31,32] developed a ranking algorithm based on Wikipedia graph metrics and used it to compare

biomedical concepts. Content-based information measures [33,34] quantify the amount of information a concept provides: the more

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_44276261

粉丝: 1
资源: 49

MeSH-gram模型：生物医学领域的语义相似度与关联性提升

Python-MeshTensorFlow是一种用于分布式深度学习的语言能够指定一大类分布式张量计算

PR转场插件Transition Packs V3.6.11WIN版 ChungDha_STV5.prfpset

Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick.pdf

N-gram-Language-Model

Adhoc-n-gram:Adhoc-n-gram距离技术

neural network designer:用于神经网络的dbms。 聊天机器人，DTree，随机森林，n-gram，...-开源

spearman的matlab代码-skip-gram-pytorch:skip-gram的完整pytorch实现

N-Gram-LM.rar_bi gram_bi gram算法_gram_n gram_n-gram

n-gram:从文本中获取n-gram

基于Hash-Gram 算法快速提取 N-gram.zip

最新资源

neural network designer:用于神经网络的dbms。聊天机器人，DTree，随机森林，n-gram，...-开源