[Spambayes] Another optimization

T. Alexander Popiel popiel@wolfskeep.com
Wed, 18 Sep 2002 11:32:02 -0700


In message:  <20020918172507.GA16930@cthulhu.gerg.ca>
             Greg Ward <gward@python.net> writes:
>
>If you haven't read the archive for this list yet, do so!

Skimmed it before posting, but didn't see reference to the basic
split-on-whitespace choice.  I did not think to check on python-dev
for prior conversation, since I had no way of knowing it would be
there (as opposed to some other unrelated forum). ;-)

>If you're really keen, check the python-dev archive for the week or
>so before this list was created -- there was a fair amount of discussion
>there.

I'll check it out.

>Anyways, in the early days of this project, Tim Peters experimented a
>lot with various tokenization schemes.  The current scheme is the one
>that did the best on his corpus.  I suspect that suggested tweaks to the
>tokenization algorithm will only be entertained if you back them up with
>solid experimental evidence that they improve things.

Yeah... I could (I've got a decent size (1-month) corpus of data
without the bias of trashing much of my non-spam stream, and a 5-year
corpus of stuff with that bias (I was keeping all spam, various mailing
lists, and all mails from various 'important' people, but trashing a
lot of the incidentals))... but since I'm not using the python
implmentation, the effort wouldn't buy me much, and I don't really care
that much.  It's also good to have variant implementation schemes out
in the wild... it gives the spammers more contradictory targets to
fail to tune to properly. ;-)

Unfortunately, my corpora have enough private data in them that
I'm not really willing to give them to the world.  Sorry.

I was just surprised to see such a basic choice made without any
apparent thought or discussion (unlike much of the token cracking
which is backed up with solid data in the source code comments).
Also, on another mailing list I'm on, one of the other people who
also implemented the filter did it with s-o-w instead of Graham's
tokenizer, and was getting results worse than mine by an order
of magnitude... but we never compared our corpora in any rigorous
way, so the difference could have come from variations in the
classifiability of our particular corpora.

- Alex