Pythonic Porter stemmers (Was: Re: Word frequencies -- Python or Perl for performance?)
sjmachin at lexicon.net
Sun Mar 17 22:13:39 CET 2002
"Van Gale" <cgale1 at cox.net> wrote in message news:<Bn%k8.4574$J54.497662 at news1.west.cox.net>...
> W.B. Frakes and R. Baeza-Yates. 1992. "Information Retrieval: Data
> Structures and Algorithms," Prentice-Hall, describes the Porter algorithm as
> well as a few other stemming algorithms. The reference for the algorithm
> Porter, M. F. 1980. "An Algorithm for Suffix Stripping." Program, 14(3),
Martin Porter has a home page for his stemming algorithm.
Read all the way through to the last line.
> Frakes mentions nothing about a patent on the Porter algorithm, and I'd be
> surprised if there were since it was pretty rare back in the "good old
Check out Porter's personal home-page. Given the comment about his
family buying him a comb after he first put his photo on the web, I
get the impression not of patent-royalty-rich but of
> I worked on a huge indexing project for a legal publisher, and we developed
> our own stemming algorithm. It was much simpler than Porter, basically
> being the most obvious conflations (like remove "s" and "ies") which covered
> the vast majority of English words, and then a list of "exceptions". Of
> course we had the advantage of 50+ editorial staff capable of proofreading
> the index finding new exceptions, but I still think that's a better way to
> go than trying to stem completely by algorithm. As hard as the Porter
> algorithm tries it still make a *lot* of mistakes.
And of course once you had an exception dictionary, for a performance
boost you'd consider dumping into it the 10^n most frequent words and
their "correct" stems whether or not the stemming algorithm gave the
"correct" result or not -- wouldn't you?
More information about the Python-list