[Spambayes] A couple of small tokenizer experiments.

Anthony Baxter anthony@interlink.com.au
Wed Nov 13 13:24:57 2002

>>> Tim Peters wrote
> Well, regardless of reason, the best cost got worse, and it did on my c.l.py
> test too, but also by a trivial amount.  I fiddled the tokenization of this
> field until it did better again, so please make sure I didn't screw you too
> badly <wink>.

Seems fine.

In this case, the trivial amount worse was kinda necessary (imho) to 
allow us to get a whole lot of other cheap wins.

> Something that helped:  it now generates log-count "no real name" metatokens
> too for address headers without real-name parts.
>         'from:no real name:2**0' 0.933186

I'll give this a go, see how it helps me.

> BTW, you should make sure the seeds aren't close together.  For example,
> using seed 123 one time, and 124 the next, will give a lot of msg overlap.

I think I tend to use 12345 and 23456 - should be far enough apart.

> > toemail:python.org and toemail:zope.org both show up in
> > my 'best discriminators' list as _very_ strong ham clues
> > (not suprising, given the mailing lists I'm on).
> Well, that's also going to make the spam that slips thru that much harder to
> catch.  Of course, after Greg deploys this system, there won't be any more
> spam slipping thru <wink>.

That's the theory, yes. Of course, if Greg doesn't deploy this, then all
the sophisticated new techniques that spammers will be forced to try will
leave poor old spamassassin terribly confused, and the amount of spam 
getting through it will fix the solution for us :)


