[Spambayes] spam tokens

Atom 'Smasher'
Mon Dec 1 21:27:50 EST 2003

two things i've noticed about spam, i'm not sure if either of them are
taken into account with SB, but maybe someone can look into this
further...  or maybe someone already has and they can tell me why these
don't work...

1) so many spams have a *lot* of spaces (and tabs?) in the subject line.
(like above {taken from real spam}).

i know... multiple spaces aren't tokens, they *separate* tokens... but
when there are 20+ in a row, in the subject line, that usually means spam.

2) so many spams are filled with nonsense and random strings
	rldvlzgj coldokiue i q wfup cadrhs r cqufqc e p fnlcgv fipv
which probably don't appear in legit email.

can these be used to detect spam? are they used?

my understanding of bayesian filtering, is that if it never before
encountered the word "rldvlzgj", then it scores 0.5 (or something fairly
neutral). well, after i've trained it on a few hundred or a few thousand
emails, i think it should have a good handle on my vocabulary and maybe be
less forgiving with words i haven't seen before.

i fully understand that the nature of bayesian filtering is often
counter-intuitive when it comes to what to look at and what to ignore, so
i'm fully prepared for someone to tell me exactly why these things don't
work the way my brain thinks they should.


