[Tutor] Evaluating Swahili Part of Speech Tagging. How can I writea Python script for that?
Emad Nawfal (عماد نوفل)
emadnawfal at gmail.com
Tue Mar 24 18:19:14 CET 2009
2009/3/24 Alan Gauld <alan.gauld at btinternet.com>
> Hi,
> That was an interesting post, but I'm not sure what you want help with.
> Is it the word splitting?
> Is it writing the POS tagger?
> Is it comparing tthe POS tagger to the standard?
> Or all of these?
>
> Alan G.
>
> "Emad Nawfal (عماد نوفل)" <emadnawfal at gmail.com> wrote in message
> news:652641e90903240835o610d013dsd6a81f4675c47c67 at mail.gmail.com...
>
> Evaluating Swahili Part of Speech Tagging. How can I write a Python script
> for that?
> # The information provided herein about Swahili may not be accurate
> # it is just intended to illustrate the problem
>
> Hi Tutors,
> I would appreciate it if you gave me ideas about how to tackle this
> problem.
>
>
> Assigninig POS tags to words is a major step in many linguistic analyses.
> POS tags give the grammatical category of words, for example:
>
> The Determiner
> man Noun
> who RelativePronoun
> came Verb
> to Preposition
> us AccusativePluralPronoun
> is CopulaPresent
> an Determiner
> engineer Noun
>
> What we usually do is train a Part-of-Speech Tagger, and then test it on an
> already tagged (gold standard) test set. After running the tagger, we get
> something like this:
>
> The Determiner Determiner
> man Noun PresentVerb
> who RelativePronoun RelativePronoun
> came Verb Verb
> to Preposition Preposition
> us AccusativePluralPronoun AccusativePluralPronoun
> is CopulaPresent CopulaPresent
> an Determiner Determiner
> engineer Noun Noun
>
> As can be seen from above, the POS tagger assigned the wrong Part of Speech
> to the word "man", and this makes it easy to calculate the POS tagger
> accuracy, simply 8 out of 9 are correct (88.8%).
>
> Swahili is a morphologically complex language. The same sentence above is
> usaually written as:
>
> theman whocametous isanengineer
>
> This means that we should run a word segmenter before running the POS
> tagger. The word segmenter of course makes mistakes which will affect the
> accuracy of the POS tagger.
> We get an output like the following where the second word (sic) is
> ill-segmented:
>
> # Segmenter + POS Tagger output file
> the Determiner
> whocame Noun
> to Preposition
> us AccusativePluralPronoun
> is CopulaPresent
> an Determiner
> engineer Noun
>
> Now, how can I measure the accuracy of this output file against the gold
> standard file below given that the line alignment is lost every time the
> segmenter makes a mistake, which happens at the rate of 15 per 1000 words:
>
> # Gold Standard File
> The Determiner
> man Noun
> who RelativePronoun
> to Preposition
> us AccusativePluralPronoun
> is CopulaPresent
> an Determiner
> engineer Noun
>
> Please note that the output file is usually in the range of 100,000 words
>
Hi Alan,
Comparing the POS tagger output to the standard. is what I want. I can do it
if I combine the segments into words and the segment tags into complex tags,
which is possible.
BUT I'm wondering whether this can be done just using the segments.
>
> --
> لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
> الغزالي
> "No victim has ever been more repressed and alienated than the truth"
>
> Emad Soliman Nawfal
> Indiana University, Bloomington
> --------------------------------------------------------
>
>
>
>
> --------------------------------------------------------------------------------
>
>
> _______________________________________________
>> Tutor maillist - Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>
>>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
--
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"
Emad Soliman Nawfal
Indiana University, Bloomington
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090324/13c57eac/attachment.htm>
More information about the Tutor
mailing list