BJTU-NLP的混合转换模型：中文/英文命名实体研究

128 浏览量更新于2024-08-29 收藏 510KB PDF 举报

"这篇论文是北京交通大学自然语言处理（BJTU-NLP）团队在第五届命名实体工作坊上的报告，探讨了一种用于中文/英文命名实体转换的混合翻译模型。该模型结合了多种特征，并利用了维基百科数据来扩展训练集，同时应用预处理和后处理规则提升性能。" 这篇研究论文详细介绍了BJTU-NLP团队在2015年第五届命名实体工作坊上提出的混合翻译模型，专注于中文到英文以及英文到中文的命名实体转换任务。命名实体识别（NER）是自然语言处理中的一个重要领域，它涉及识别文本中的专有名词，如人名、地名和组织名等。在跨语言环境中，命名实体的准确转换对于信息检索、机器翻译和语义理解至关重要。混合翻译模型是论文的核心内容，它融合了多种方法来提高转换的准确性。这种模型可能包括统计机器翻译（SMT）、深度学习模型（如神经网络）以及规则基础的方法。通过结合这些不同的技术，系统能够更好地捕捉到命名实体的音译规律，同时减少错误。论文指出，为了进一步优化模型性能，研究人员利用了外部数据，特别是从维基百科中提取的数据，来扩充训练集。这样做可以增加模型对各种命名实体的曝光，从而提高泛化能力。此外，预处理和后处理规则的应用也是提高性能的关键步骤。预处理可能包括文本清洗、标准化和实体识别，而后处理可能涉及消歧、错误修正和上下文一致性检查。实验结果显示，BJTU-NLP系统的最终性能在测试语料库上与当时的其他先进系统相当，证明了混合翻译模型的有效性。这项工作不仅展示了命名实体转换的创新方法，也为未来跨语言信息处理的研究提供了有价值的参考。通过深入研究和改进这种混合模型，可以期望在命名实体识别和转换的精度上取得更大的突破。

Proceedings of the Fifth Named Entity Workshop, joint with 53rd ACL and the 7th IJCNLP, pages 67–71,

Beijing, China, July 26-31, 2015.

2015 Association for Computational Linguistics

A Hybrid Transliteration Model for Chinese/English Named Entities

—BJTU-NLP Report for the 5th Named Entities Workshop

Dandan Wang, Xiaohui Yang, Jinan Xu, Yufeng Chen, Nan Wang, Bojia Liu, Jian Yang, Yujie Zhang

School of Computer and Information Technology

Beijing Jiaotong University

{13120427, xhyang, jaxu, chenyf, 14120428, 14125181, 13120441, yjzhang}@bjtu.edu.cn

Abstract

This paper presents our system (BJTU-NLP

system) for the NEWS2015 evaluation task of

Chinese-to-English and English-to-Chinese

named entity transliteration. Our system adopts a

hybrid machine transliteration approach, which

combines several features. To further improve

the result, we adopt external data extracted from

wikipeda to expand the training set. In addition,

pre-processing and post-processing rules are

utilized to further improve the performance. The

final performance on the test corpus shows that

our system achieves comparable results with

other state-of-the-art systems.

1 Introduction

Machine transliteration transforms the script of a

word from a source language to a target language

automatically. Knight(1998) proposes a

phoneme-based approach to solve the

transliteration between English names and

Japanese katakana. The phoneme-based

approach needs a pronunciation dictionary for

one or two languages. These dictionaries usually

do not exist or can’t cover all the names.

Jia(2009) views machine transliteration as a

special example of machine translation and uses

the phrase-based machine translation model to

solve it. However, using the English letters and

Chinese characters as basic mapping units will

make ambiguity in the alignment and translation

step. Huang(2011) proposes a novel

nonparametric Bayesian using synchronous

adaptor grammars to model the grapheme-based

transliteration.

This paper describes a machine transliteration

system and data measures for participating

NEWS2015 evaluation, which is abbreviated as

BJTU-NLP. We participated in two

transliteration masks: Chinese-to-English and

English-to-Chinese named entity transliteration

task. This report briefly introduces the

implementation framework of our machine

transliteration system, and analyzes the

experimental results over the evaluation data.

The following parts are organized as follows:

Section 2 briefly introduces the implementation

framework of the transliteration system. Section

3 introduces the details of the experiment and

data processing in brief. In Section 4,

experimental results are given and the results of

the experiment are analyzed. Section 5 is our

conclusion and future work.

2 System Description

By treating transliteration as a translation

problem, BJTU-NLP has realized a machine

transliteration system based on the combination

of multiple features by a log-linear model, to

complete the corresponding experiments with

English-Chinese and Chinese-English name pairs

The description of the whole transliteration

system is as follows.

2.1 A Log-linear Machine Transliteration

Model

In this evaluation, a tool is used in our machine

transliteration system based on the fusion

multiple features. In this system, we introduce a

linear log model for transliteration (Koehn et al.,

2007), using combination features in it. The

process of transliteration can be described as

follows: for a given source language name s find

the optimal result  from all possible results e

，

which is computed by:

  





󰇡













󰇛



󰇜󰇢





󰇡













󰇛



󰆓



󰇜󰇢



󰆓

(1)

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38580959

粉丝: 3

BJTU-NLP的混合转换模型：中文/英文命名实体研究

Python库 | indic_transliteration-2.3.20-py3-none-any.whl

Agreement on Target-Bidirectional LSTMs for Sequence-to-Sequence Learning

hindi-english-code-mixing-lidf-ner

transliteration

russian-transliteration

cherokee-transliteration

crowd-indic-transliteration-data:Xlit-Crowd

Transliteration-crx插件

Bangla Transliteration Class-开源

Language Transliteration-crx插件

最新资源