statistics. The Glove model combines the advantages of Word2vec
model in learning representations based on context as well as matrix
factorization methods in leveraging global co-occurrence statistics.
The model is trained using a weighted least squares objective
function such that error between model predicted values and global
count statistics from training corpus is minimized. The authors illu-
strated the importance of ratio of co-occurrence probabilities and pro-
posed the base model as
are focal word vectors and
is vector of context word.
represent the probability of words i and j to co-occur with word
k.
To introduce linearity and avoid mixing vector dimensions, the
authors introduced vector difference and dot product respectively.
(4)
Further to account for symmetry that word and context word are in-
terchangeable in co-occurrence matrix, the model takes the form
+ + =u v b b log X( )
i
T
k i k ik
represents the co-occurrence frequency of word i with word k.
Finally, the vectors are learned with weighted least squares objective
function.
+ +
=
f X u v b b log X( )( ( ))
i k
V
ik
i
T
k i k ik
, 1
2
represents model predicted values,
re-
presents value calculated from training corpus, V is vocabulary size.
Further f(x) is a weighted function included in objective function so that
rare or frequent co-occurrences are not over weighted and it is defined
as
Table 2
Summary of various medical codes.
Schema Description Number of codes Examples
ICD-10 (Diagnosis) Prepared by World Health Organization(WHO) and contains codes for disease, signs and symptoms
etc.
68,000 ‘R070’: Pain in Throat
‘H612’: Impacted cerumen
CPT (Procedures) Prepared by American Medical Association(AMA) and contain codes for medical, surgical and
diagnostic services
9641 ‘90658’:Flue Shot
‘90716’: Chicken Pox Vaccine
LOINC (Laboratory) Prepared by Regenstrief Institute, a US nonprofit medical research organization and contain codes
for laboratory observations
80,868 ‘8310-5’: Body Temperature
‘5792-7’: Glucose
RxNorm (Medications) Prepared by US National Library of Medicine and is a part of UMLS. It contains codes for all the
medications available in US market.
1,16,075 ‘1191’: Aspirin
‘215256’: Anacin
Table 3
Summary of embedding models.
Model Architecture Advantages Disadvantages
CBOW [9] Log Bilinear Faster compared to skipgram model.
Represents frequent words well.
Ignore morphological information as well as polysemy
nature of words
No embeddings for OOV, misspelled and rare words.
Skipgram [9] Log Bilinear Efficient with small training datasets.
Represents infrequent words well.
Ignore morphological information as well as polysemy
nature of words
No embeddings for OOV, misspelled and rare words.
PV-DM [23] Log Bilinear PV-DM alone give good results for many of the tasks. Compared to PV-DBOW, requires more memory as it is
needed to store Softmax weights and word vectors.
PV-DBOW [23] Log Bilinear Need to store only the word vectors and so requires less memory.
Compared to PV-DM, it is simple and faster.
Need to be used along with PV-DM to give consistent
results across tasks
Glove [10] Log Bilinear Combines advantages of word2vec model in learning representations based on
context as well as matrix factorization methods in leveraging global co-occurrence
statistics.
Ignore morphological information as well as polysemy
nature of words
No embeddings for OOV, misspelled and rare words.
FastText [11] Log Bilinear Encode morphological information in word vectors.
Embeddings for OOV, misspelled and rare words.
Pretrained word vectors for 157 languages.
Computationally intensive and memory requirements
increases with the size of corpus.
Ignore polysemy nature of words.
ELMo [12] BiLSTM Generate context dependent vector representations and hence account for
polysemy nature of words
Embeddings for OOV, misspelled and rare words.
Computationally intensive and hence requires more
training time.
Table 4
Summary of hyper parameters in Word2Vec model.
Parameter Default Value Meaning
size 100 Dimension of vector
window 5 Size of context window
min_count 5 Minimum frequency of a word to be included in vocabulary
workers 3 Number of threads to train the model
sg 0 0 means CBOW model is used and 1 means skipgram is used.
hs 0 1 for hierarchical softmax
0 and ‘negative’ is non-zero means negative sampling is used
negative 5 0 means, no negative sampling
0 means negative sampling is applied and the value represents number of noise words to be used.
K.S. Kalyan and S. Sangeetha
Journal of Biomedical Informatics 101 (2020) 103323
5