3.1. MULTEXT-East Integration in NLTK 3. Implementation
a mapping from the MSD tagset to the target tagset can be added.
In this case it is important that one MSD tag is mapped to no more than one target
tag. But multiple MSD tags can still map to the same target tag.
3.1.3 The MTEDownloader
The MTEDownloader is a standalone download manager to obtain the files if it is
not possible to use the NLTK functionality. It can be either started via execut-
ing the script as a python program (it has a main method) or by directly calling
MTEDownloader.download(). At first one has to choose the installation directory,
then the corpus is downloaded from clarin.si and extracted.
3.1.4 Sample Usage of our Corpus Reader Implementation
The following code shows some basic examples how the corpus reader, the Down-
loader and the provided utility methods could be used:
> # a t f i r s t im po rt a l l n e c e s s a r y f i l e s
> im p o rt MTEDownloader , mte
>
> # t he n ( i f no t y e t done ) download t h e Multex t−E a st c o r p u s
> # t h i s c o u l d a l s o be done v i a t he n l t k d o w n l o ad er n l t k . download ( )
> MTEDownloader . download ( )
Where s h o u l d t h e c o r p u s be sa v e d ? [ ( 0 , ’ /home/ s t i e g l m a / n l tk _ d a ta ’ ) , ( 1 , ’ / us r / s h a re / nl tk _ d a ta ’ ) ,
( 2 , ’ / u s r / l o c a l / s h ar e / nl t k _ da t a ’ ) , ( 3 , ’ / u s r / l i b / nl tk _ d a ta ’ ) , ( 4 , ’ / us r / l o c a l / l i b / nl tk _ d a ta ’
) , ( 5 , ’ custom ’ ) ] [ 0 ] : 0
Downloaded 1 4 80 08 05 o f 148 00 8 0 5 b y te s (1 0 0 . 0 0 % )
Download f i n i s h e d
E x t ra c t i n g f i l e s . . .
Done
>
> # i f you do n ot ha v e an n l t k v e r s i o n where ou r c o r p u s r e a d e r i s a lr ea d y
> # i n t e g r a t e d , you hav e t o ma n ua l ly c r e a t e i t
> # now we open t h e E n g li s h v e r s i o n o f t h e book 1984 wi th ou r r e a d e r
> r ea d e r = mte . MTECorpusReader ( r o o t=" / p ath / t o / m u lt e xt / co r p u s / " , f i l e i d s =[ ’ oana−en . xml ’ ] )
>
> # o t h e r w is e you c a n j u s t do t h e f o l l o w i n g
> from n l t k . c o r p u s imp o r t m u lt e x t_ e a s t a s r e ad e r
>
> # and the n we r e t r i e v e th e f i r s t word i n th e f i r s t word / t a g t u p l e o f t h i s f i l e
> r ea d e r . t a g g e d _ s e n t s ( f i l e i d s =" oana−en . xml " ) [ 0 ] [ 0 ]
( ’ I t ’ , ’#Pp3ns ’ )
>
> # t h e t a g i s now i n t h e Mu l text−Ea s t (MSD) f orm a t , we want i t to be
> # t h e more w e l l known c o r r e s p o n d i n g u n i v e r s a l t a g :
> r ea d e r . t a g g e d _ s e n t s ( f i l e i d s =" oana−en . xml " , t a g s e t=" u n i v e r s a l " ) [ 0 ] [ 0 ] [ 1 ]
’PRON ’
>
> # now we want t o s e e so m e t hi ng i n th e c o n c o r d a n c e v i ew :
> from n l t k im po r t Text
> Text ( r e a d e r . words ( f i l e i d s =" oana−en . xml " ) ) . c o n c o r d a n c e ( " b r o th er " )
D i s pl a y i n g 2 o f 80 matc h es :
o l l o w you ab o ut when you move . Big B r o t h e r i s w a tc hi n g you , t h e c a p t i o n bene a
se− f r o n t i mm e d i at e l y o p p o s i t e . Big B r o t he r i s wa tc hi n g you , t h e c a p t io n s a i d