A Method of Unknown Words Processing
for Neural Machine Translation Using HowNet
Shaotong Li, JinAn Xu
(&)
, Yujie Zhang, and Yufeng Chen
School of Computer and Information Technology,
Beijing Jiaotong University, Beijing, China
{shaotongli,jaxu,yjzhang,chenyf}@bjtu.edu.cn
Abstract. An inherent weakness of neural machine translation (NMT) systems
is their inability to correctly translate unknown words. Traditional unknown
words processing methods are usually based on word vectors trained on large
scale of monolingual corpus. Replacing the unknown words according to the
similarity of word vectors. However, it suffers from two weaknesses: Firstly, the
resulting vectors of unknown words are not of high quality; Secondly, it is
difficult to deal with polysemous words. This paper proposes an unknown word
processing method by integrating HowNet. Using the concepts and sememes in
HowNet to seek the replacement words of unknown words. Experimental results
show that our proposed method can not only improves the performance of
NMT, but also provides some advantages compared with the traditional
unknown words processing methods.
Keywords: NMT
Unknown words HowNet Concept Sememe
End-to-End NMT is a kind of machine translation method proposed in recent years [1–
4]. Most of the NMT systems are based on the encoder-decoder framework, the
encoder encodes the source sentence into a vector, and the decoder decodes the vector
into the target sentence. Compared with the traditional statistical machine translation
(SMT), NMT has many advantages, and has shown greatly performance in many
translation tasks.
But NMT still has the problem of unknown words which is caused by the limited
vocabulary scale. In order to control the temporal and spatial expenses of the model,
NMT usually uses small vocabularies in the source side and the target side [5]. The
words that are not in the vocabulary are unknown words, which will be replaced by an
“UNK” symbol. A feasible method to solve this problem is to find out the substitute
in-vocabulary words of the unknown words. Li et al. proposed a replacing method
based on word vector similarity [5], the unknown words are replaced by the synonyms
in the vocabulary through the cosine distance of the word vector and the language
model. However, there are some unavoidable problems with this method. Firstly, the
vectors of rare words are difficult to train; Secondly, the trained word vectors cannot
express various semantics of the polysemous words and cannot adapt to the replace-
ment of the polysemous words in different contexts.
To solve these problems, this paper proposes an unknown words processing
method based on HowNet. This met hod uses HowNet’s concept s and sememes as well
© Springer Nature Singapore Pte Ltd. 2017
D.F. Wong and D. Xiong (Eds.): CWMT 2017, CCIS 787, pp. 20–29, 2017.
https://doi.org/10.1007/978-981-10-7134-8_3