临床NLP中的嵌入模型深度综述：医学语料库与应用比较

需积分: 10 123 浏览量更新于2024-07-16 收藏 2.26MB PDF 举报

在"临床自然语言处理中的嵌入综述"这篇调查论文中，作者Katikapalli Subramanyam Kalyan和S. Sangeetha，来自印度特里奇尼尼信息技术学院计算机应用系的文本分析和自然语言处理实验室，深入探讨了临床自然语言处理领域的嵌入技术。论文主要关注的是将变量长度的文本映射到密集固定长度向量的分布式表示（或嵌入），这种技术能够捕捉并转移先前的知识，为下游任务提供支持。文章的核心内容包括以下几个关键点： 1. 医学语料库与特点：研究者详细介绍了各类医学领域的语料库，这些语料库包含了丰富的临床文本数据，对于理解疾病描述、病症记录和医学文献等具有重要意义。每个语料库的特点被逐一阐述，如数据来源、覆盖的主题范围以及对模型训练的重要性。 2. 医学规范：论文强调了在临床NLP中遵循的医学术语和编码标准，例如ICD（国际疾病分类）、SNOMED CT（系统性疾病编码）等，这些规范对于保证嵌入模型的准确性和一致性至关重要。 3. 嵌入式模型概述与比较：作者对当前流行的嵌入模型，如Word2Vec、GloVe、FastText、BERT、ELMo和BERTweet等进行了简要概述，分析了它们的原理、优缺点以及在临床文本处理任务中的应用。对比了这些模型在处理医疗领域特定词汇和上下文理解方面的表现。 4. 临床嵌入的分类与详细讨论：论文根据临床应用场景和需求，将嵌入模型进一步细分为词嵌入、句嵌入、文档嵌入等类型，分别阐述了它们各自的适用场景和挑战。例如，针对命名实体识别、关系抽取和文本分类等任务，不同的嵌入方法可能表现出不同的优势。 5. 总结与未来展望：论文总结了当前临床NLP中嵌入技术的研究现状，同时指出了一些未解决的问题和未来研究的方向，如如何结合深度学习和迁移学习优化嵌入模型，以及如何处理医学领域的专业术语和稀疏性问题。这篇综述性论文为临床自然语言处理领域的研究人员和从业者提供了宝贵的参考，帮助他们更好地理解和选择适合的嵌入模型，推动临床文本处理技术的发展。

statistics. The Glove model combines the advantages of Word2vec

model in learning representations based on context as well as matrix

factorization methods in leveraging global co-occurrence statistics.

The model is trained using a weighted least squares objective

function such that error between model predicted values and global

count statistics from training corpus is minimized. The authors illu-

strated the importance of ratio of co-occurrence probabilities and pro-

posed the base model as

=F u u v

, ,

i j k

(3)

where

u v,

i j

are focal word vectors and

is vector of context word.

and

represent the probability of words i and j to co-occur with word

To introduce linearity and avoid mixing vector dimensions, the

authors introduced vector diﬀerence and dot product respectively.

=F dot u u v

i j k

(4)

Further to account for symmetry that word and context word are in-

terchangeable in co-occurrence matrix, the model takes the form

+ + =u v b b log X( )

k i k ik

(5)

Here

represents the co-occurrence frequency of word i with word k.

Finally, the vectors are learned with weighted least squares objective

function.

+ +

f X u v b b log X( )( ( ))

i k

k i k ik

, 1

(6)

Here

+ +u v b b

k i k

represents model predicted values,

log X( )

re-

presents value calculated from training corpus, V is vocabulary size.

Further f(x) is a weighted function included in objective function so that

rare or frequent co-occurrences are not over weighted and it is deﬁned

Table 2

Summary of various medical codes.

Schema Description Number of codes Examples

ICD-10 (Diagnosis) Prepared by World Health Organization(WHO) and contains codes for disease, signs and symptoms

etc.

68,000 ‘R070’: Pain in Throat

‘H612’: Impacted cerumen

CPT (Procedures) Prepared by American Medical Association(AMA) and contain codes for medical, surgical and

diagnostic services

9641 ‘90658’:Flue Shot

‘90716’: Chicken Pox Vaccine

LOINC (Laboratory) Prepared by Regenstrief Institute, a US nonproﬁt medical research organization and contain codes

for laboratory observations

80,868 ‘8310-5’: Body Temperature

‘5792-7’: Glucose

RxNorm (Medications) Prepared by US National Library of Medicine and is a part of UMLS. It contains codes for all the

medications available in US market.

1,16,075 ‘1191’: Aspirin

‘215256’: Anacin

Table 3

Summary of embedding models.

Model Architecture Advantages Disadvantages

CBOW [9] Log Bilinear Faster compared to skipgram model.

Represents frequent words well.

Ignore morphological information as well as polysemy

nature of words

No embeddings for OOV, misspelled and rare words.

Skipgram [9] Log Bilinear Eﬃcient with small training datasets.

Represents infrequent words well.

Ignore morphological information as well as polysemy

nature of words

No embeddings for OOV, misspelled and rare words.

PV-DM [23] Log Bilinear PV-DM alone give good results for many of the tasks. Compared to PV-DBOW, requires more memory as it is

needed to store Softmax weights and word vectors.

PV-DBOW [23] Log Bilinear Need to store only the word vectors and so requires less memory.

Compared to PV-DM, it is simple and faster.

Need to be used along with PV-DM to give consistent

results across tasks

Glove [10] Log Bilinear Combines advantages of word2vec model in learning representations based on

context as well as matrix factorization methods in leveraging global co-occurrence

statistics.

Ignore morphological information as well as polysemy

nature of words

No embeddings for OOV, misspelled and rare words.

FastText [11] Log Bilinear Encode morphological information in word vectors.

Embeddings for OOV, misspelled and rare words.

Pretrained word vectors for 157 languages.

Computationally intensive and memory requirements

increases with the size of corpus.

Ignore polysemy nature of words.

ELMo [12] BiLSTM Generate context dependent vector representations and hence account for

polysemy nature of words

Embeddings for OOV, misspelled and rare words.

Computationally intensive and hence requires more

training time.

Table 4

Summary of hyper parameters in Word2Vec model.

Parameter Default Value Meaning

size 100 Dimension of vector

window 5 Size of context window

min_count 5 Minimum frequency of a word to be included in vocabulary

workers 3 Number of threads to train the model

sg 0 0 means CBOW model is used and 1 means skipgram is used.

hs 0 1 for hierarchical softmax

0 and ‘negative’ is non-zero means negative sampling is used

negative 5 0 means, no negative sampling

0 means negative sampling is applied and the value represents number of noise words to be used.

K.S. Kalyan and S. Sangeetha

Journal of Biomedical Informatics 101 (2020) 103323

剩余20页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

临床NLP中的嵌入模型深度综述：医学语料库与应用比较

Spire.PDF.DLL可用版压缩包介绍

"从n-gram到BERT：自然语言处理语言模型发展综述

深度学习驱动的自然语言处理进展：词嵌入与Transformer模型详解

R语言大会-自然语言 通过网络嵌入学习文本嵌入 共30页.pdf

深度学习自然语言处理概述（116页ppt）.pdf

javascript高级教程.pdf

深度学习word2vec学习笔记pdf版.pdf

嵌入式系统——毛德操.pdf

Oracle SQL高级编程.pdf )

自然语言处理领域的文本数据增强技术综述

最新资源

R语言大会-自然语言通过网络嵌入学习文本嵌入共30页.pdf