[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?
Emad Nawfal (عماد نوفل)
emadnawfal at gmail.com
Tue Mar 24 16:35:22 CET 2009
Evaluating Swahili Part of Speech Tagging. How can I write a Python script
# The information provided herein about Swahili may not be accurate
# it is just intended to illustrate the problem
I would appreciate it if you gave me ideas about how to tackle this problem.
Assigninig POS tags to words is a major step in many linguistic analyses.
POS tags give the grammatical category of words, for example:
What we usually do is train a Part-of-Speech Tagger, and then test it on an
already tagged (gold standard) test set. After running the tagger, we get
something like this:
The Determiner Determiner
man Noun PresentVerb
who RelativePronoun RelativePronoun
came Verb Verb
to Preposition Preposition
us AccusativePluralPronoun AccusativePluralPronoun
is CopulaPresent CopulaPresent
an Determiner Determiner
engineer Noun Noun
As can be seen from above, the POS tagger assigned the wrong Part of Speech
to the word "man", and this makes it easy to calculate the POS tagger
accuracy, simply 8 out of 9 are correct (88.8%).
Swahili is a morphologically complex language. The same sentence above is
usaually written as:
theman whocametous isanengineer
This means that we should run a word segmenter before running the POS
tagger. The word segmenter of course makes mistakes which will affect the
accuracy of the POS tagger.
We get an output like the following where the second word (sic) is
# Segmenter + POS Tagger output file
Now, how can I measure the accuracy of this output file against the gold
standard file below given that the line alignment is lost every time the
segmenter makes a mistake, which happens at the rate of 15 per 1000 words:
# Gold Standard File
Please note that the output file is usually in the range of 100,000 words
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
"No victim has ever been more repressed and alienated than the truth"
Emad Soliman Nawfal
Indiana University, Bloomington
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor