[Spambayes] Here's why "generate_long_skips: False" worked...

Tim Peters tim.one@comcast.net
Mon, 30 Sep 2002 20:27:41 -0400


[Skip Montanaro]
> I figured out why the false positive I saw was interpreted as
> text.  I had been incorrectly forwarding mail from the
> itineraries@mojam.com command processor alias (for probably five
> years or more).  This wasn't a big deal in the past because I am
> the only person who receives such messages, but it was incorrect
> nonethelss.  Instead of sending the original message out with
> Resent-*: headers prepended, I sent a new message with the
> original message as the body, e.g.:

[and the original headers "look like body text", ditto the MIME
 decorations]

> I just fixed that piece of code over the weekend.  Since I won't
> be getting any new mail like the above note in the future, I suppose
> I should purge them from my collection or adjust those messages to
> have the correct format.

Out of curiousity, what percentage of your corpus consisted of such msgs?
And were they all ham?

> So, should I pull the generate_long_skips option back out?

I'm neutral, but if you leave it in please change the comment (it's
misleading now).  I believe that whenever a skip token does some good, it's
indicating a weakness in the tokenizer (this is nearly tautological:  when
skip does some good, it says there's useful info in "very long words"!).
Over time, I hope people are inspired to find out just what good it is that
we're getting by crudely summarizing via "skip" tokens, and extract it
purposefully.  An easy example is Asian spam, where the lack of whitespace
ends up generating oodles of skip tokens (and '8bit%' tokens), but there
must be a more effective way to generate useful tokens for that without
bloating the database beyond reason.  So I hope that skip-generation will
eventually become worthless.