[Spambayes] Watch out for this
Balazs Attila Mihaly
cdman at coder.hu
Wed Sep 10 07:59:31 EDT 2003
Hello.
Sorry for writing without registering to any mailing list or at sf.net,
but I'm kind of lazy. Anyway, I've become kind of intereseted in text
classification after reading some of the papers here. For me spam isn't
a big issue yet (getting 2-5 spam mails daily can be supported), but
still I hope you are progressing well. Maybe later on I'll install it
too.
Now the issue I'm writing about: there has been a lot of talk about the
tokenizing part (should it include the html tags or not, etc.), but I
would like to add one idea of my own: as you know in html pages
characters can be written as #<number>;, where <number> represents the
ASCII (or maybe UNICODE - I'm not sure) code of the character. Now, if
you don't convert these characters back to their corresponding values, a
spammer could use almost infinite variations of words (writing
hel#<coder for l>;o or #<code for h>ello or replacing multiple
characters with their codes). So my suggestion would be to convert them
back. PHP has already such a function (html_entity_decode), so you could
take a look at their source code for help.
-Cd-MaN
P.S. Sorry if this issue has already been bought up.
More information about the Spambayes
mailing list