[Spambayes] Watch out for this

Skip Montanaro skip at pobox.com
Wed Sep 10 11:18:43 EDT 2003


    Balazs> would like to add one idea of my own: as you know in html pages
    Balazs> characters can be written as #<number>;, where <number>
    Balazs> represents the ASCII (or maybe UNICODE - I'm not sure) code of
    Balazs> the character. Now, if you don't convert these characters back
    Balazs> to their corresponding values, a spammer could use almost
    Balazs> infinite variations of words (writing hel#<coder for l>;o or
    Balazs> #<code for h>ello or replacing multiple
    Balazs> characters with their codes). So my suggestion would be to
    Balazs> convert them back.

Good suggestion.  I'm not sure if the tokenizer does this already, but a
quick grep for '&#[0-9];' through my current training database (about 3
million lines) suggests this is still fairly infrequently used.  I only
found about 2100 lines (around 0.07%) of the lines contained a numeric
entity.  If/when the spammers start using such techniques and they turn out
to cause problems for the classifier, it should be fairly easy to extend the
tokenizer to make the necessary substitutions.

Skip



More information about the Spambayes mailing list