[Spambayes] Another optimization
Greg Ward
gward@python.net
Wed, 18 Sep 2002 13:25:07 -0400
On 18 September 2002, T. Alexander Popiel said:
> * Graham was very specific in describing his tokenizer...
> and you folks seem to have ignored that description.
> Instead, you're using split-on-whitespace augmented by
> a few handcrafted hacks for URLs, addresses, and the like.
> This puzzles me, since I seem to get better results using
> the tokenization that Graham suggested.
If you haven't read the archive for this list yet, do so! If you're
really keen, check the python-dev archive for the week or so before this
list was created -- there was a fair amount of discussion there.
Anyways, in the early days of this project, Tim Peters experimented a
lot with various tokenization schemes. The current scheme is the one
that did the best on his corpus. I suspect that suggested tweaks to the
tokenization algorithm will only be entertained if you back them up with
solid experimental evidence that they improve things.
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
I'm a lumberjack and I'm OK / I sleep all night and I work all day