[Spambayes] full o' spaces

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Fri Mar 7 11:59:19 EST 2003

3/7/2003 11:19:02 AM, Charles Cazabon <python-spambayes at discworld.dyndns.org> 

>Tim Stone - Four Stones Expressions <tim at fourstonesExpressions.com> wrote:
>> Ya, I noticed that same thing yesterday.  Maybe an "excessive whitespace" 
>> clue, or "many single character words" clue, or something like that?
>Ratio of number of spaces to number of non-spaces in the body, perhaps?  Add 
>metatoken if this exceeds 0.25 or something like that.

Any threshold we use for anything like this has to be configurable.  Otherwise 
the spammers will simply make sure they don't exceed the threshold...

In normal (english) language usage, there is probably a relatively well 
understood distribution of unigrams, bigrams, trigrams, and longer words.  Any 
'severe' departure from this distribution could be a very good spam clue.  For 
example, I could use the following to defeat a whitespace and unigram counting 

Bu y  m ore  st u ff  t h an  yo u  EVE R  tho ug ht  you  c ou l d  h and le.

It's a bit harder to read than regular text, but the human brain is amazingly 
adaptive to stuff like this.  This kind of trickery is likely to be one avenue 
that spammers try to heavily use to defeat us.  (the other being malformation 
of mail, imo).

Oh, and btw, don't believe for a second that spammers don't subscribe to this 
list :)

c'est moi - TimS

