[Spambayes] full o' spaces

Fri Mar 7 16:01:18 EST 2003

At 02:29 PM 3/7/03 -0600, Tim Stone - Four Stones Expressions wrote:
> >Can you look at percentage of unigrams, bigrams, trigrams, and ngrams?
>
>I'm thinking that, for English anyway, nu < nb < nt < nn is the rule.  If 
>this
>rule is violated, then that's a spam indicator.  I sure don't know if that's
>the case with other languages, though...

There may be a simple way to deal with the entire range of possible 
"character noise" techniques, be it whitespace, letter->number 
substitution, etc.  What if we simply create a meta-token which is driven 
by the ratio of recognized to unrecognized (non-meta) tokens?  In this way, 
the more noise a spammer adds to their message, the greater the probability 
that the message will be considered "noisy spam".  Repeats of the same 
message after training would result in the message being "recognized spam", 
repeats before training would be spotted by their being "noisy".

The natural spammer countermove to this is that they'll have to add lots of 
boilerplate "hammy" english text to bump themselves back into the "unsure" 
range, and/or begin adding noise only to highly spammy words.  I already 
get tons of spam about "seks" and "r4pe" and similar things.  I'm not sure 
what to do about these countermoves, but at least it puts us back on level 
ground with the spammers again.  I'm afraid that adding "bulk noise" like 
whitespace and punctuation to messages would be a too-easily automated 
anti-bayes move for spammers to adopt in general.