[Spambayes] full o' spaces
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Fri Mar 7 11:59:19 EST 2003
3/7/2003 11:19:02 AM, Charles Cazabon <python-spambayes at discworld.dyndns.org>
>Tim Stone - Four Stones Expressions <tim at fourstonesExpressions.com> wrote:
>> Ya, I noticed that same thing yesterday. Maybe an "excessive whitespace"
>> clue, or "many single character words" clue, or something like that?
>Ratio of number of spaces to number of non-spaces in the body, perhaps? Add
>metatoken if this exceeds 0.25 or something like that.
Any threshold we use for anything like this has to be configurable. Otherwise
the spammers will simply make sure they don't exceed the threshold...
In normal (english) language usage, there is probably a relatively well
understood distribution of unigrams, bigrams, trigrams, and longer words. Any
'severe' departure from this distribution could be a very good spam clue. For
example, I could use the following to defeat a whitespace and unigram counting
Bu y m ore st u ff t h an yo u EVE R tho ug ht you c ou l d h and le.
It's a bit harder to read than regular text, but the human brain is amazingly
adaptive to stuff like this. This kind of trickery is likely to be one avenue
that spammers try to heavily use to defeat us. (the other being malformation
of mail, imo).
Oh, and btw, don't believe for a second that spammers don't subscribe to this
c'est moi - TimS
More information about the Spambayes