This is similar to request #817813 (Consider bad spelling a sign of spam).
Partial quote of 817813: "If more than xx% of the message is misspelled (esp
the subject), consider it to be spam."

I frequently find that messages in the possible spam category are full of
gibberish HTML (randomly generated characters). Many of these also include
large numbers of gibberish words in the text as well, but {messages with
gibberish text} seems to be a subset of {messages with gibberish HTML}.

SpamBayes has tended to give these messages middling scores. I do the
incremental training, and SpamBayes thereby acquires a lot of what I call
"0/1" tokens -- tokens that have appeared in 0 ham, 1 spam, but will
probably never appear again.

Maybe SpamBayes could make a token out of the number of unrecognized HTML
tag names. Obviously, this means there'd need to be a dictionary of known
HTML words. Also obviously, the dictionary would fall out of date over time.
But at least an HTML dictionary would be easier to update and search than a
generalized multilingual dictionary.

Has this been considered?

Jim Becker

