[Spambayes] To think like a spammer...

Josiah Carlson jcarlson@uci.edu
Sat, 28 Sep 2002 19:54:18 -0700


> Even if you back up and allow single character tokens, you're only going
> to recognize a handful of those.  A spammer could stuff the end of the
> message with ham words to overcome the effect of the single char. tokens.
> Do these space-words have to be collapsed to defeat the effect?
> 
> Actually, protection against stuffing the end of a spam w/ ham words is
> an angle we have to be careful about anyway.

If there were enough spammers that used such things, they would
definitely find their messages in the spam corpus.  If one then started
allowing 1-character tokens, through spambayes, because most words are
not of a single letter, those letters that happen to not be used as
initials in sigs or otherwise, would likely be flagged as being more
spammy than hammy.

The problem with removing spaces is that the software would need to have
a listing of valid words, and one would need to check all the possible
concatenation of tokens that could _potentially_ create a word.  You
could look at it in terms of a bitmap.

Given a list of bits, each representing a different token, you can use 1
to represent a continuation of the previous word (concatenated) and a 0
to represent the beginning of a new word. Current non-concatenation
gives us all 0's.  You can create a secondary bitmap to show which of
the tokens/bits can be a continuation.

If you take a random approach to it, you're looking at 2**(number of
single-character entries) for the number of ways the message can be
concatenated.  That's REALLY bad.

However, you could do a DFS on a tree of valid words, which would
produce unbelievably fast results in C, and even all right results using
python dictionaries (something that I feel like coding up this evening,
it'll be neat), but you would still have problems with purposeful
misspellings.

Good point though,
 - Josiah