[spambayes-dev] spammy subject lines

Tim Peters tim.one at comcast.net
Fri Oct 10 23:10:46 EDT 2003


[Paul Sorenson]
> I am getting quite a bit of spam with subject lines like the
> following:
>
> subject: Lon.g an^d Str;ong al)l Nigh_t j-jcgzies
> subject: Ch-eck ou=t ou-r sel)ection _of grea)t R_X -emgffj
>
> Looking at the tokenizer code for subject lines I was wondering if
> there was value in stripping punctuation then doing the usual word
> tokenisation.

If done in addition to the Subject line gimmicks already there, I expect
this wouldn't hurt and might help.  We've had good results in the past by
increasing attention paid to the subject line, probably because the subject
line is one of the best ways for an email to grab attention.

> I seems there are other special cases taken into account for the
> subject line so care would need to be taken not to break those.

Add new code and it won't break -- just be straightforward.  For example, a
later stage weeds out duplicates, so don't worry about generating a token
that's already been generated by the other stuff.

> I would be happy to have a crack at a patch if this hasn't been tried
> already, I just wanted to float the idea first given that I am
> unfamiliar with the existing codebase and unsure whether it might
> have already been tried.

I don't think this particular trick was tried before.  The inclusion of
punctuation gibberish in Subject lines seems to be a relatively new
obfuscation gimmick.  "Invisible" obfuscation tricks in HTML get reused
forever, but tricks that obscure what the user sees tend to be faddish (if
you're a spammer, you don't want to reduce your response rate, and
obfuscating what potential customers see is generally bad advertising; the
novelty of punctuation-laden words may grab more attention for a short
while, but it gets old very fast).




More information about the spambayes-dev mailing list