[Spambayes] classifying tokens

Tim Peters tim.one at comcast.net
Tue Dec 9 20:19:37 EST 2003


[Atom 'Smasher']
> when scoring, i noticed that some tokens seem to be classified based
> on where (or how) they're found...

When testing showed that was helpful, yes, tokens get tagged.  Tokens coming
from an email header line are generally tagged with the name of the header
line ("subject:...", "date:..."), and pieces coming from embedded URLs are
tagged with "url:".  There are some others.

> most of these are self-explanatory, but what about "virus"?? is there
> part of an email that let's SB know it's a virus?

No, but there are certain tokens that appear to be *associated* with
viruses.  That doesn't mean an email containing one of those *is* a virus,
it's just one more clue to throw into the pot.  Remember that spambayes has
no preconceived notions of what ham or spam are.  The appearance of

    height=0

in an email will get tagged with a "virus:" prefix by spambayes, but in
*your* data it might be a ham clue.  SpamBayes doesn't pre-judge that.  I'll
add that height=0 and width=0 in HTML are almost always used to hide
*something* from you, and that is a common trick in virus email.  I'm not
sure I've ever seen a legitimate use for it that I recognized, but in my
(currently small) database I must have some:

    'virus:width=0'   spamcount: 3 hamcount: 0
    'virus:</iframe'  spamcount: 1 hamcount: 0
    'virus:src="cid:' spamcount: 8 hamcount: 2
    'virus:height="0' spamcount: 1 hamcount: 3
    'virus:height=0'  spamcount: 3 hamcount: 0
    "virus:src='cid:" spamcount: 1 hamcount: 0
    'virus:width="0'  spamcount: 1 hamcount: 3
    'virus:src=cid:'  spamcount: 1 hamcount: 0
    'virus:</script'  spamcount: 5 hamcount: 3
    'virus:<script'   spamcount: 5 hamcount: 3
    'virus:<iframe'   spamcount: 1 hamcount: 0
    "virus:height='0" spamcount: 1 hamcount: 0

In particular, the presence of embedded code (the "script" tags) is only
mildly spammy in my database.  That's almost certainly an artifact of
mistake-based training, though:  most spam and viruses with embedded script
already get nailed as spam, so I never train on most of them.  OTOH, I get a
couple of ham HTML newsletters with embedded script, and have trained on a
few of those because they looked pretty darned spammy otherwise.  If I
trained on everything instead, I'm pretty sure all of those would look
strongly spammy.

> what about "skip"?
>
> i haven't found any documentation on this...

That's because there isn't any <wink>.  The internals of the database aren't
documented, and there's no promise that they'll remain the same.  If you
really want to know, that's cool:  the way to do it is to get the source
code and study it.  All tokens are produced by the tokenizer.py module.
Between the code and the comments in that, there's a long and detailed
explanation about what "skip:" tokens mean and why they're generated.  It's
hard to explain more briefly than that, because it's one of the features
SpamBayes generates "for no reason at all" -- as the comments say, I don't
know *why* it helps, I only know that testing showed that it did help.




More information about the Spambayes mailing list