[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?

Emad Nawfal (عماد نوفل) emadnawfal at gmail.com
Tue Mar 24 16:35:22 CET 2009


Evaluating Swahili Part of Speech Tagging. How can I write a Python script
for that?
# The information provided herein about Swahili may not be accurate
# it is just intended to illustrate the problem

Hi Tutors,
I would appreciate it if you gave me ideas about how to tackle this problem.


Assigninig POS tags to words is a major step in many linguistic analyses.
POS tags give the grammatical category of words, for example:

The Determiner
man Noun
who RelativePronoun
came Verb
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

What we usually do is train a Part-of-Speech Tagger, and then test it on an
already tagged (gold standard) test set. After running the tagger, we get
something like this:

The Determiner    Determiner
man Noun    PresentVerb
who RelativePronoun    RelativePronoun
came Verb    Verb
to Preposition    Preposition
us AccusativePluralPronoun    AccusativePluralPronoun
is CopulaPresent    CopulaPresent
an Determiner    Determiner
engineer Noun    Noun

As can be seen from above, the POS tagger assigned the wrong Part of Speech
to the word "man", and this makes it easy to calculate the POS tagger
accuracy, simply 8 out of 9 are correct (88.8%).

Swahili is a morphologically complex language. The same sentence above is
usaually written as:

theman whocametous isanengineer

This means that we should run a word segmenter before running the POS
tagger. The word segmenter of course makes mistakes which will affect the
accuracy of the POS tagger.
We get an output like the following where the second word (sic) is
ill-segmented:

# Segmenter + POS Tagger output file
the Determiner
whocame Noun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Now, how can I measure the accuracy of this output file against the gold
standard file below given that the line alignment is lost every time the
segmenter makes a mistake, which happens at the rate of 15 per 1000 words:

# Gold Standard File
The Determiner
man Noun
who RelativePronoun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Please note that the output file is usually in the range of 100,000 words

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090324/0daa91cf/attachment-0001.htm>


More information about the Tutor mailing list