跨语言词嵌入学习：矩阵共因子分解方法

研究论文

52 浏览量更新于2024-08-27 收藏 629KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇研究论文‘Learning cross-lingual word embeddings via MatrixCo-factorization’探讨了如何通过矩阵共因子分解来学习跨语言词嵌入，旨在建立一个通用的空间模型，以实现语言不变的语义特征泛化。作者包括Tianze Shi、Zhiyuan Liu、Yang Liu和Maosong Sun，他们来自清华大学的智能技术与系统国家重点实验室和信息科学与技术国家实验室，以及计算机科学与技术系。论文发表在2015年7月26日至31日在北京举行的第53届计算语言学协会年度会议及第7届国际自然语言处理联合会议的短篇论文集中。" 正文: 在自然语言处理领域，词嵌入是一种有效的技术，它将词汇转化为低维向量，以捕捉单词之间的语义和语法关系。对于多语言环境，跨语言词嵌入（Cross-lingual word embeddings）具有重大意义，因为它们允许不同语言间的词汇和语义理解进行有效映射。本文提出了一种新的方法，即矩阵共因子分解框架，用于学习跨语言词嵌入。传统的单语词嵌入方法如Word2Vec或GloVe，主要关注于单一语言的语料库，无法直接应用于多语言环境。相反，跨语言词嵌入试图创建一个共同的空间，使得不同语言的词汇可以在其中对应，这有助于多语言信息检索、机器翻译和跨语言问答等任务。矩阵共因子分解是一种数学技术，通常用于数据分析和推荐系统，通过分解大型矩阵来发现潜在的结构。在本研究中，这种方法被应用于词嵌入，将单语的词共现矩阵分解为两个较小的矩阵，这两个矩阵可以捕获词汇的语义信息。作者在单语训练目标的基础上定义了矩阵分解形式，并引入跨语言约束，使单语矩阵可以同时进行因子分解。这一过程不仅保留了单语的语义信息，还引入了语言间的关系，从而生成跨语言的词嵌入。在实验部分，论文可能会评估所提出方法的性能，比较它与其他现有方法（如MUSE、FastAlign或XLM）的准确性，例如通过在标准的评价任务上（如词翻译任务或跨语言文档分类）进行测试。这些实验结果将验证矩阵共因子分解在学习跨语言词嵌入方面的有效性。这篇论文贡献了一种创新的方法，利用矩阵共因子分解来学习跨语言词嵌入，有望提升多语言任务的性能。这一工作强调了数学方法在解决自然语言处理问题中的潜力，并为未来的研究提供了新的视角，尤其是在处理语言多样性和跨语言语义理解方面。

资源详情

资源推荐

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics

and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 567–572,

Beijing, China, July 26-31, 2015.

2015 Association for Computational Linguistics

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

Tianze Shi Zhiyuan Liu Yang Liu Maosong Sun

State Key Laboratory of Intelligent Technology and Systems

Tsinghua National Laboratory for Information Science and Technology

Department of Computer Science and Technology

Tsinghua University, Beijing 100084, China

stz11@mails.tsinghua.edu.cn

{liuzy, liuyang2011, sms}@tsinghua.edu.cn

Abstract

A joint-space model for cross-lingual

distributed representations generalizes

language-invariant semantic features.

In this paper, we present a matrix co-

factorization framework for learning

cross-lingual word embeddings. We

explicitly deﬁne monolingual training

objectives in the form of matrix de-

composition, and induce cross-lingual

constraints for simultaneously factorizing

monolingual matrices. The cross-lingual

constraints can be derived from parallel

corpora, with or without word alignments.

Empirical results on a task of cross-lingual

document classiﬁcation show that our

method is effective to encode cross-lingual

knowledge as constraints for cross-lingual

word embeddings.

1 Introduction

Word embeddings allow one to represent words in

a continuous vector space, which characterizes the

lexico-semanic relations among words. In many

NLP tasks, they prove to be high-quality features,

successful applications of which include language

modelling (Bengio et al., 2003), sentiment analy-

sis (Socher et al., 2011) and word sense discrimi-

nation (Huang et al., 2012).

Like words having synonyms in the same lan-

guage, there are also word pairs across lan-

guages which share resembling semantic proper-

ties. Mikolov et al. (2013a) observed a strong

similarity of the geometric arrangements of cor-

responding concepts between the vector spaces of

different languages, and suggested that a cross-

lingual mapping between the two vector spaces is

technically plausible. In the meantime, the joint-

space models for cross-lingual word embeddings

are very desirable, as language-invariant seman-

tic features can be generalized to make it easy to

transfer models across languages. This is espe-

cially important for those low-resource languages,

where it allows one to develop accurate word rep-

resentations of one language by exploiting the

abundant textual resources in another language,

e.g., English, which has a high resource density.

The joint-space models are not only technically

plausible, but also useful for cross-lingual model

transfer. Further, studies have shown that using

cross-lingual correlation can improve the quality

of word representations trained solely with mono-

lingual corpora (Faruqui and Dyer, 2014).

Deﬁning a cross-lingual learning objective is

crucial at the core of the joint-space model. Her-

mann and Blunsom (2014) and Chandar A P et

al. (2014) tried to calculate parallel sentence (or

document) representations and to minimize the

differences between the semantically equivalen-

t pairs. These methods are useful in capturing

semantic information carried by high-level units

(such as phrases and beyond) and usually do not

rely on word alignments. However, they suffer

from reduced accuracy for representing rare to-

kens, whose semantic information may not be well

generalized. In these cases, ﬁner-grained informa-

tion at lexical level, such as aligned word pairs,

dictionaries, and word translation probabilities, is

considered to be helpful.

cisk

y et al. (2014) integrated word aligning

process and word embedding in machine transla-

tion models. This method makes full use of paral-

lel corpora and produces high-quality word align-

ments. However, it is unable to exploit the richer

monolingual corpora. On the other hand, Zou et al.

(2013) and Faruqui and Dyer (2014) learnt word

embeddings of different languages in separate s-

paces with monolingual corpora and projected the

embeddings into a joint space, but they can only

capture linear transformation.

In this paper, we address the above challenges

with a framework of matrix co-factorization. We

567

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38663544

粉丝: 4
资源: 970

跨语言词嵌入学习：矩阵共因子分解方法

Cross-Lingual Word Embeddings.pdf

Joint Multilingual Supervision for Cross-lingual Entity Linking.pdf

推荐30个以上比较好的命名实体识别模型

给我推荐20个比较流行的NLU 深度学习模型

有没有好的参考文献提供一下

腾讯，阿里怎么搞大模型迁移的

PE-51686NL

下载CROHMED数据集

我正在生成一个知识图谱，想用pytorch实现关系抽取，请给我提供几篇参考文献

给我推荐20个比较流行的NLg 模型

现在国内外有哪些大模型？

推荐30个以上比较好的自然语言处理模型以及github源码？

ES组合多个分词器为一个新的分词器

UTF-8 bom的作用

ChatGPT的特点 用英文回答

描述汉语教学的教学方法

AAL各脑区对应的MNI坐标

Spring Boot 评论系统.zip

基于thinkPHP6+的站长必备工具箱最新在线工具箱网站系统源码分享下载

springboot基于Android的饮食健康管理系统毕业论文.docx

最新资源

ChatGPT的特点用英文回答