[Tutor] Evaluating Swahili Part of Speech Tagging. How can I writea Python script for that?

Emad Nawfal (عماد نوفل) emadnawfal at gmail.com
Tue Mar 24 18:21:07 CET 2009


2009/3/24 Emad Nawfal (عماد نوفل) <emadnawfal at gmail.com>

>
>
> 2009/3/24 Alan Gauld <alan.gauld at btinternet.com>
>
> Hi,
>> That was an interesting post, but I'm not sure what you want help with.
>> Is it the word splitting?
>> Is it writing the POS tagger?
>> Is it comparing tthe POS tagger to the standard?
>> Or all of these?
>>
>> Alan G.
>>
>> "Emad Nawfal (عماد نوفل)" <emadnawfal at gmail.com> wrote in message
>> news:652641e90903240835o610d013dsd6a81f4675c47c67 at mail.gmail.com...
>>
>> Evaluating Swahili Part of Speech Tagging. How can I write a Python script
>> for that?
>> # The information provided herein about Swahili may not be accurate
>> # it is just intended to illustrate the problem
>>
>> Hi Tutors,
>> I would appreciate it if you gave me ideas about how to tackle this
>> problem.
>>
>>
>> Assigninig POS tags to words is a major step in many linguistic analyses.
>> POS tags give the grammatical category of words, for example:
>>
>> The Determiner
>> man Noun
>> who RelativePronoun
>> came Verb
>> to Preposition
>> us AccusativePluralPronoun
>> is CopulaPresent
>> an Determiner
>> engineer Noun
>>
>> What we usually do is train a Part-of-Speech Tagger, and then test it on
>> an
>> already tagged (gold standard) test set. After running the tagger, we get
>> something like this:
>>
>> The Determiner    Determiner
>> man Noun    PresentVerb
>> who RelativePronoun    RelativePronoun
>> came Verb    Verb
>> to Preposition    Preposition
>> us AccusativePluralPronoun    AccusativePluralPronoun
>> is CopulaPresent    CopulaPresent
>> an Determiner    Determiner
>> engineer Noun    Noun
>>
>> As can be seen from above, the POS tagger assigned the wrong Part of
>> Speech
>> to the word "man", and this makes it easy to calculate the POS tagger
>> accuracy, simply 8 out of 9 are correct (88.8%).
>>
>> Swahili is a morphologically complex language. The same sentence above is
>> usaually written as:
>>
>> theman whocametous isanengineer
>>
>> This means that we should run a word segmenter before running the POS
>> tagger. The word segmenter of course makes mistakes which will affect the
>> accuracy of the POS tagger.
>> We get an output like the following where the second word (sic) is
>> ill-segmented:
>>
>> # Segmenter + POS Tagger output file
>> the Determiner
>> whocame Noun
>> to Preposition
>> us AccusativePluralPronoun
>> is CopulaPresent
>> an Determiner
>> engineer Noun
>>
>> Now, how can I measure the accuracy of this output file against the gold
>> standard file below given that the line alignment is lost every time the
>> segmenter makes a mistake, which happens at the rate of 15 per 1000 words:
>>
>> # Gold Standard File
>> The Determiner
>> man Noun
>> who RelativePronoun
>> to Preposition
>> us AccusativePluralPronoun
>> is CopulaPresent
>> an Determiner
>> engineer Noun
>>
>> Please note that the output file is usually in the range of 100,000 words
>>
> Hi Alan,
> Comparing the POS tagger output to the standard. is what I want. I can do
> it if I combine the segments into words and the segment tags into complex
> tags, which is possible.
> BUT I'm wondering whether this can be done just using the segments.
>
>>
 Hi Alan,
Comparing the POS tagger output to the standard. is what I want. I can do it
if I combine the segments into words and the segment tags into complex tags,
which is possible.
BUT I'm wondering whether this can be done just using the segments and their
respective simple tags.

>
>> --
>> لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
>> الغزالي
>> "No victim has ever been more repressed and alienated than the truth"
>>
>> Emad Soliman Nawfal
>> Indiana University, Bloomington
>> --------------------------------------------------------
>>
>>
>>
>>
>> --------------------------------------------------------------------------------
>>
>>
>>  _______________________________________________
>>> Tutor maillist  -  Tutor at python.org
>>> http://mail.python.org/mailman/listinfo/tutor
>>>
>>>
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>
>
>
>
> --
> لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
> الغزالي
> "No victim has ever been more repressed and alienated than the truth"
>
> Emad Soliman Nawfal
> Indiana University, Bloomington
> --------------------------------------------------------
>



-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090324/44482329/attachment-0001.htm>


More information about the Tutor mailing list