[Spambayes] degeneration

Tue Jan 21 21:47:12 EST 2003

[Neale Pickett]
> So one of the more interesting things I left the spam conference with
> was Paul Graham's notion of "degeneration".  The idea is simple.  If you
> tokenize "FREE!!!!", but that's not in your wordlist, try the following
> until you get a match:
>
>   FREE!!!!
>   Free!!!!
>   free!!!!
>   FREE!!!
>   Free!!!
>   free!!!
>   FREE!!
>   Free!!
>   free!!
>   FREE!
>   Free!
>   free!
>   FREE
>   Free
>   free

We fold case so it's easier for us (just 5 possibilities).

> He claims this helps a lot.

Ya, but he's still artificially boosting ham counts by a factor of 2 -- it's
small wonder then that some other gimmick is needed to counteract the bias.

> I'm currently in the midst of getting hammiefilter to integrate more
> cleanly with Gnus and Mutt, and merging mboxtrain and hammiebulk.  But
> this should be relatively easy to implement and test.  Any takers?

It wouldn't be hard to implement, and I agree it's interesting.  So far as
testing goes, I don't have any test data that *can* show an improvement
anymore, so I lost interest in tweaking the algorithms.  Do you have test
sets that could show improvements?  If so, you can eyeball the mistakes and
usually make a good guess as to whether a specific new gimmick would help
them.  What you can't usually guess is whether the gimmick would hurt the
correctly classified msgs more than it helps the mistakes.