NLTK中的MULTEXT-East语料库读取器与POS标记器

需积分: 9 17 浏览量更新于2024-07-19 收藏 1.08MB PDF 举报

本篇项目报告由Alexander Böhm、Thomas Stieglmaier和Thomas Ziegler撰写，隶属于德国帕绍大学的信息技术与数学系，于2015年10月15日提交。他们的研究工作主要集中在自然语言处理工具包（NLTK）中开发一个专门针对MULTEXT-East语料库的Corpus Reader和Part-of-Speech (POS) Tagger。MULTEXT-East语料库的独特之处在于它包含了乔治·奥威尔的小说《1984》的多语言翻译版本，同时提供了丰富的词性标注（Part-of-Speech tags），这对于NLTK中的大多数单一语言语料库来说是相当罕见的。在NLTK中，这个语料库使用了MSD（Microsoft Statistical Tagset）作为其内置的词性标注集，MSD能够提供高度精确的词汇分类，这对于研究者和开发者来说是一个重要的资源，尤其在跨语言分析和比较时，可以提供丰富的语言特性信息。报告的核心部分是评估了基于NLTK和scikit-learn的不同POS Tagger。研究者特别关注了这些tagger在处理不同语言和词性标注集上的性能差异。他们发现，整体上，tagger对于更抽象的词性标注集表现更好，这可能是因为这类tagger具有更强的语言泛化能力，能够在多语言环境中取得较高的准确率。这项工作的成果不仅增强了NLTK的功能，也为多语言自然语言处理的研究者和实践者提供了一个宝贵的工具，使得他们能够更有效地利用MULTEXT-East语料库进行跨语言文本分析和机器学习任务。此外，它还揭示了词性标注在多语言环境中的重要性和适用性，以及如何根据具体任务选择合适的词性标注集以优化模型性能。通过这个研究，我们可以看到在自然语言处理领域，定制化的工具和资源对于提升多语言处理效率和准确性的重要性。

3.1. MULTEXT-East Integration in NLTK 3. Implementation

Figure 3.1: Overview over the architecture of the corpus reader for MULTEXT-

East inside NLTK

3.1.1 The MTECorpusReader

The MTECorpusReader extends the TaggedCorpusReader class of NLTK . As con-

structor argument one can choose which languages of the MULTEXT-East cor-

pus should be loaded. As all other corpus readers, the words, sentences, paragraphs,

tagged words, and tagged sentences can be retrieved. Additionally we have methods

that return the lemmatized words and sentences.

For all methods that return something with tags a ﬁlter can be given as parameter

such that e.g. only words with #N–s are in the returned list. The ﬁlter works in the

way, that all - are seen as unspeciﬁed and can therefore have an arbitrary value, also

if the given ﬁlter is too short it is ﬁlled up with - to the needed length internally.

This means that with the example ﬁlter, for the English ﬁle, we would retrieve all

singular nouns not regarding diﬀerences in gender or if the noun is proper or not.

3.1.2 The MTETagConverter

The MTETagConverter class contains the method for converting MSD tags into the

matching Universal tags

. If any other tagset is required a mapping ﬁle that contains

A mapping for MSD to Universal tags is predeﬁned by us and can be used out of the box

3.1. MULTEXT-East Integration in NLTK 3. Implementation

a mapping from the MSD tagset to the target tagset can be added.

In this case it is important that one MSD tag is mapped to no more than one target

tag. But multiple MSD tags can still map to the same target tag.

3.1.3 The MTEDownloader

The MTEDownloader is a standalone download manager to obtain the ﬁles if it is

not possible to use the NLTK functionality. It can be either started via execut-

ing the script as a python program (it has a main method) or by directly calling

MTEDownloader.download(). At ﬁrst one has to choose the installation directory,

then the corpus is downloaded from clarin.si and extracted.

3.1.4 Sample Usage of our Corpus Reader Implementation

The following code shows some basic examples how the corpus reader, the Down-

loader and the provided utility methods could be used:

> # a t f i r s t im po rt a l l n e c e s s a r y f i l e s

> im p o rt MTEDownloader , mte

> # t he n ( i f no t y e t done ) download t h e Multex t−E a st c o r p u s

> # t h i s c o u l d a l s o be done v i a t he n l t k d o w n l o ad er n l t k . download ( )

> MTEDownloader . download ( )

Where s h o u l d t h e c o r p u s be sa v e d ? [ ( 0 , ’ /home/ s t i e g l m a / n l tk _ d a ta ’ ) , ( 1 , ’ / us r / s h a re / nl tk _ d a ta ’ ) ,

( 2 , ’ / u s r / l o c a l / s h ar e / nl t k _ da t a ’ ) , ( 3 , ’ / u s r / l i b / nl tk _ d a ta ’ ) , ( 4 , ’ / us r / l o c a l / l i b / nl tk _ d a ta ’

) , ( 5 , ’ custom ’ ) ] [ 0 ] : 0

Downloaded 1 4 80 08 05 o f 148 00 8 0 5 b y te s (1 0 0 . 0 0 % )

Download f i n i s h e d

E x t ra c t i n g f i l e s . . .

Done

> # i f you do n ot ha v e an n l t k v e r s i o n where ou r c o r p u s r e a d e r i s a lr ea d y

> # i n t e g r a t e d , you hav e t o ma n ua l ly c r e a t e i t

> # now we open t h e E n g li s h v e r s i o n o f t h e book 1984 wi th ou r r e a d e r

> r ea d e r = mte . MTECorpusReader ( r o o t=" / p ath / t o / m u lt e xt / co r p u s / " , f i l e i d s =[ ’ oana−en . xml ’ ] )

> # o t h e r w is e you c a n j u s t do t h e f o l l o w i n g

> from n l t k . c o r p u s imp o r t m u lt e x t_ e a s t a s r e ad e r

> # and the n we r e t r i e v e th e f i r s t word i n th e f i r s t word / t a g t u p l e o f t h i s f i l e

> r ea d e r . t a g g e d _ s e n t s ( f i l e i d s =" oana−en . xml " ) [ 0 ] [ 0 ]

( ’ I t ’ , ’#Pp3ns ’ )

> # t h e t a g i s now i n t h e Mu l text−Ea s t (MSD) f orm a t , we want i t to be

> # t h e more w e l l known c o r r e s p o n d i n g u n i v e r s a l t a g :

> r ea d e r . t a g g e d _ s e n t s ( f i l e i d s =" oana−en . xml " , t a g s e t=" u n i v e r s a l " ) [ 0 ] [ 0 ] [ 1 ]

’PRON ’

> # now we want t o s e e so m e t hi ng i n th e c o n c o r d a n c e v i ew :

> from n l t k im po r t Text

> Text ( r e a d e r . words ( f i l e i d s =" oana−en . xml " ) ) . c o n c o r d a n c e ( " b r o th er " )

D i s pl a y i n g 2 o f 80 matc h es :

o l l o w you ab o ut when you move . Big B r o t h e r i s w a tc hi n g you , t h e c a p t i o n bene a

se− f r o n t i mm e d i at e l y o p p o s i t e . Big B r o t he r i s wa tc hi n g you , t h e c a p t io n s a i d

3.2. Part-of-Speech Taggers for MULTEXT-East 3. Implementation

3.2 Part-of-Speech Taggers for MULTEXT-East

For the evaluation of our work we implemented Part-of-Speech taggers based on

diﬀerent algorithms, on the one hand we have Brill’s algorithm, which is also shipped

with NLTK and on the other hand we have three machine learning algorithms that

are shipped with scikit-learn. The following sections will provide an overview over

the implementations and show the way how they are used.

3.2.1 The MTEBrillTagger

The implementation contains a wrapper around NLTK s implementation of the Brill

Tagger. It builds a Brill Tagger based on a default tagger which can be speciﬁed. By

default a unigramm tagger is used. Additionally it contains a method to evaluate

the tagger.

The Brill Tagger for the MULTEXT-East corpus can be conﬁgured like the stan-

dard NLTK Brill Tagger. Additionally the set of templates can be speciﬁed. Either

a function from nltk.tag.brill returning a list of tags (e.g. fntbl37) can be spec-

iﬁed as a string or a list of templates can be passed in the template parameter.

The MTEBrillTagger needs a set of tagged sentences to train the tagger. It is

possible to tune the behavior of the tagger by giving additional parameters like the

maximum number of rules or a diﬀerent initial tagger.

The evaluate method takes a set of test sentences to evaluates the already trained

tagger. It prints the accuracy.

The metrics method takes a set of sentences for testing and gives deeper informa-

tion about the evaluation. There are accuracy, precision, recall, f-score and out of

vocabulary words. Additionally a confusion matrix can be generated.

To evaluate the Brill Tagger there is the class BrillTaggerEval. It takes a whole

corpus, tagged with MSD tags and will do the ten-crossfold-validation as described

in section 4. It also takes a string specifying the tagset as well as a few parameters to

tune the Brill Tagger. The evaluate method can do an n-crossfold-validation while

the default for n is ten. The output will be the averages of the values speciﬁed in

the metrics method of the MTEBrillTagger. Confusion matrices are not supported

by this method at the moment.

3.2.2 Part-of-Speech-Taggers with scikit-learn

As a conclusion of the performance ﬁndings described in Section 2.4 we chose a multi-

nomial naive Bayes, a support vector machine with a linear kernel and a perceptron

剩余77页未读，继续阅读

marshallcao

粉丝: 0
资源: 5

NLTK中的MULTEXT-East语料库读取器与POS标记器

Impinj_ItemTest_1_8_0

pos-tagger-nltk-scikit-learn：使用自定义训练模型的语音Tagger，由Scikit-Learn和NLTK实施

import nltk nltk.download('omw-1.4')

import nltk nltk.download('stopwords') from nltk.corpus import stopwords # 导入停用词

获取nltk.corpus()中austen-emma.txt语料，并以8：2划分为训练集和测试集， 计算测试集中每个句子的二元语法和三元语法的平均生成概率 分别计算该语料库中二元语法、三元语法、四元语法的困惑度 直接给出python 代码和结果

No module named 'nltk.corpus'

nltk.corpus模块中的words怎么导入

nltk_data-gh-pages.zip怎样使用

最新资源

获取nltk.corpus()中austen-emma.txt语料，并以8：2划分为训练集和测试集，计算测试集中每个句子的二元语法和三元语法的平均生成概率分别计算该语料库中二元语法、三元语法、四元语法的困惑度直接给出python 代码和结果