[Spambayes] full o' spaces
Phillip J. Eby
pje at telecommunity.com
Fri Mar 7 16:01:18 EST 2003
At 02:29 PM 3/7/03 -0600, Tim Stone - Four Stones Expressions wrote:
> >Can you look at percentage of unigrams, bigrams, trigrams, and ngrams?
>I'm thinking that, for English anyway, nu < nb < nt < nn is the rule. If
>rule is violated, then that's a spam indicator. I sure don't know if that's
>the case with other languages, though...
There may be a simple way to deal with the entire range of possible
"character noise" techniques, be it whitespace, letter->number
substitution, etc. What if we simply create a meta-token which is driven
by the ratio of recognized to unrecognized (non-meta) tokens? In this way,
the more noise a spammer adds to their message, the greater the probability
that the message will be considered "noisy spam". Repeats of the same
message after training would result in the message being "recognized spam",
repeats before training would be spotted by their being "noisy".
The natural spammer countermove to this is that they'll have to add lots of
boilerplate "hammy" english text to bump themselves back into the "unsure"
range, and/or begin adding noise only to highly spammy words. I already
get tons of spam about "seks" and "r4pe" and similar things. I'm not sure
what to do about these countermoves, but at least it puts us back on level
ground with the spammers again. I'm afraid that adding "bulk noise" like
whitespace and punctuation to messages would be a too-easily automated
anti-bayes move for spammers to adopt in general.
More information about the Spambayes