[Spambayes] degeneration

Tim Peters tim_one at email.msn.com
Tue Jan 21 21:47:12 EST 2003


[Neale Pickett]
> So one of the more interesting things I left the spam conference with
> was Paul Graham's notion of "degeneration".  The idea is simple.  If you
> tokenize "FREE!!!!", but that's not in your wordlist, try the following
> until you get a match:
>
>   FREE!!!!
>   Free!!!!
>   free!!!!
>   FREE!!!
>   Free!!!
>   free!!!
>   FREE!!
>   Free!!
>   free!!
>   FREE!
>   Free!
>   free!
>   FREE
>   Free
>   free

We fold case so it's easier for us (just 5 possibilities).

> He claims this helps a lot.

Ya, but he's still artificially boosting ham counts by a factor of 2 -- it's
small wonder then that some other gimmick is needed to counteract the bias.

> I'm currently in the midst of getting hammiefilter to integrate more
> cleanly with Gnus and Mutt, and merging mboxtrain and hammiebulk.  But
> this should be relatively easy to implement and test.  Any takers?

It wouldn't be hard to implement, and I agree it's interesting.  So far as
testing goes, I don't have any test data that *can* show an improvement
anymore, so I lost interest in tweaking the algorithms.  Do you have test
sets that could show improvements?  If so, you can eyeball the mistakes and
usually make a good guess as to whether a specific new gimmick would help
them.  What you can't usually guess is whether the gimmick would hurt the
correctly classified msgs more than it helps the mistakes.




More information about the Spambayes mailing list