[Spambayes] Gibberish as tokens?

Tim Peters tim.one at comcast.net
Mon May 24 21:21:12 EDT 2004


[Becker, Jim]
> This is similar to request #817813 (Consider bad spelling a sign of spam).
> Partial quote of 817813: "If more than xx% of the message is misspelled
> (esp the subject), consider it to be spam."
>
> I frequently find that messages in the possible spam category are full of
> gibberish HTML (randomly generated characters). Many of these also include
> large numbers of gibberish words in the text as well, but {messages with
> gibberish text} seems to be a subset of {messages with gibberish HTML}

But are those what contribute to the "middling scores"?  No:  if spambayes
hasn't seen a token before, it has no effect on the score.  Gibberish is
neutral.  So it would probably be more fruitful to study the clues, and
determine which tokens make these msgs appear hammy to your classifier
(*something* about them must "look hammy", else they wouldn't score as
Unsure).

> SpamBayes has tended to give these messages middling scores. I do the
> incremental training, and SpamBayes thereby acquires a lot of what I call
> "0/1" tokens -- tokens that have appeared in 0 ham, 1 spam, but will
> probably never appear again.

The grown-up <wink> term is "hapax" (a feature that appears only once in a
corpus).  SpamBayes does generate a lot of those, even if you have no
gibberish in your input.  That's generally true across all kinds of computer
text indexing applications, by the way (everything from typos to
computer-generated message ids contribute to this phenomenon).

However, note that SpamBayes throws away all HTML tags before tokenization,
so any gibberish hapax you see comes from the body of the msg.

> Maybe SpamBayes could make a token out of the number of unrecognized HTML
> tag names.

Right now, that would be all of them, legitimate or not.

> Obviously, this means there'd need to be a dictionary of known HTML
> words. Also obviously, the dictionary would fall out of date over
> time.  But at least an HTML dictionary would be easier to update and
> search than a generalized multilingual dictionary.

That's so.

> Has this been considered?

Probably, but not by me <wink>.  Gibberish words don't seem to give my
classifier any trouble, so if such a gimmick helped, I wouldn't be able to
notice it.  More fundamentally, a single token can't determine the outcome
no matter how spammy (or hammy) it scores.  It would just be another piece
of evidence, treated like all others (there are no tokens of any kind with
special significance in this system -- and if there were, that would give
spammers clear targets to attack), and adding one new token hasn't made a
statistically significant difference in test results since the very early
days of the project.

It would be easy to try it, but it doesn't sound promising enough to me to
be worth the bother (coding it is simple, but testing is a real bother).





More information about the Spambayes mailing list