[Spambayes] Watch out for this
Skip Montanaro
skip at pobox.com
Wed Sep 10 11:18:43 EDT 2003
Balazs> would like to add one idea of my own: as you know in html pages
Balazs> characters can be written as #<number>;, where <number>
Balazs> represents the ASCII (or maybe UNICODE - I'm not sure) code of
Balazs> the character. Now, if you don't convert these characters back
Balazs> to their corresponding values, a spammer could use almost
Balazs> infinite variations of words (writing hel#<coder for l>;o or
Balazs> #<code for h>ello or replacing multiple
Balazs> characters with their codes). So my suggestion would be to
Balazs> convert them back.
Good suggestion. I'm not sure if the tokenizer does this already, but a
quick grep for '&#[0-9];' through my current training database (about 3
million lines) suggests this is still fairly infrequently used. I only
found about 2100 lines (around 0.07%) of the lines contained a numeric
entity. If/when the spammers start using such techniques and they turn out
to cause problems for the classifier, it should be fairly easy to extend the
tokenizer to make the necessary substitutions.
Skip
More information about the Spambayes
mailing list