[Tutor] Learning natural language processing and Python? [learning about "stemming" words]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 17 Sep 2002 22:52:40 -0700 (PDT)


Hi everyone,

I'm starting to learn "Natural Language Processing", which tries to use a
computer to tease meaning out of natural language.  I was wondering if
anyone was interested in this sort of thing?

I've started to find some resources out there.  One big one is the "NLTK"
toolkit, which acts as an umbrella for a lot of NLP software:

    http://nltk.sourceforge.net/


One task that appears to be fairly well understood is getting the "stem"
of a word.

http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm


Stemming is used to toss out most of the variation in word endings, to get
at the very "stem" of a word.  There's an algorithm called the "Porter
Stemming Algorithm" that does interesting things to English words:

    http://www.tartarus.org/~martin/PorterStemmer/index.html

It tries to remove the inflectional endings of English text, so it's
similar to the root-word function that's in WordNet.  The page above gives
a Python implementation of the algorithm!  Here's an example of what it
does:

###
>>> import porter
>>> p = porter.PorterStemmer()
>>> def stem(word): return p.stem(word, 0, len(word)-1)
...
>>> stem('superfly')
'superfli'
>>> stem('learning')
'learn'
>>> stem('pythonic')
'python'
>> stem('programmer')
'programm'
>>> stem('elbereth')
'elbereth'
###

So I think it's something of a super-plural remover.  I'm glad to see that
'Elbereth' came out untainted.

Anyway, sorry for posting about a random topic; I'm hoping that someone
else will be interested enough to help me learn this stuff... *grin*