[Tutor] [part-of-speech tagging / montytagger / penn treebank]

Sat Dec 7 18:05:02 2002

> For example: could the split command [split(s[, sep[, maxsplit]])] be
> modified to accept more than one 'sep' argument? That odd suggestion
> reflects my goal (generating an index for my log files): I don't see any
> simple software method of distinguishing nouns & adjectives in my logs -
> but splitting on the basis of connectives & aricles (to, the, in, etc.)
> might leave the noun - adjective relationship intact (more meaningful
> index entries I hope).

Hmmm... I just did a quick check, and ran into the following:

    http://web.media.mit.edu/~hugo/research/montytagger.html

In Natural Language Processing (NLP), a common task that NLP researchers
do is take a sentence and attach part-of-speech roles to each word.

Here's a brief run through the program:

###
dyoo@coffeetable:~/montytagger-1.0/python$ python MontyTagger.py

***** INITIALIZING ******
Lexicon OK!
LexicalRuleParser OK!
ContextualRuleParser OK!
*************************

MontyTagger v1.0
--send bug reports to hugo@media.mit.edu--

> This is a test of the emergency broadcast system

This/DT is/VBZ a/DT test/NN of/IN the/DT emergency/NN broadcast/NN
system/NN
-- monty took 0.02 seconds. --

> In a hole, there lived a hobbit.

In/IN a/DT hole/NN ,/, there/EX lived/VBD a/DT hobbit/NN ./.
-- monty took 0.19 seconds.
###

Wow!  This is pretty neat!

This program takes a sentence, and tries its best to attach part-of-speech
tags to each word.  Here are the meanings of some of those tags:

    DT  --> determiner
    IN  --> preposition or subordinating conjunction
    VBZ --> verb, 3rd person singular present
    NN  --> noun, singular or mass
    EX  --> Existential there
    VBD --> Verb, past tense

I do not know a single one of these tags yet.  *grin* But there is a good
list of them in the Penn Treebank Project:

    http://www.cis.upenn.edu/~treebank/
    ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

Good luck to you!