[Spambayes] Another optimization

Greg Ward gward@python.net
Wed, 18 Sep 2002 13:25:07 -0400


On 18 September 2002, T. Alexander Popiel said:
> * Graham was very specific in describing his tokenizer...
>   and you folks seem to have ignored that description.
>   Instead, you're using split-on-whitespace augmented by
>   a few handcrafted hacks for URLs, addresses, and the like.
>   This puzzles me, since I seem to get better results using
>   the tokenization that Graham suggested.

If you haven't read the archive for this list yet, do so!  If you're
really keen, check the python-dev archive for the week or so before this
list was created -- there was a fair amount of discussion there.

Anyways, in the early days of this project, Tim Peters experimented a
lot with various tokenization schemes.  The current scheme is the one
that did the best on his corpus.  I suspect that suggested tweaks to the
tokenization algorithm will only be entertained if you back them up with
solid experimental evidence that they improve things.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
I'm a lumberjack and I'm OK / I sleep all night and I work all day