[Spambayes] Watch out for this

Balazs Attila Mihaly cdman at coder.hu
Wed Sep 10 07:59:31 EDT 2003


Hello.




Sorry for writing without registering to any mailing list or at sf.net, 


but I'm kind of lazy. Anyway, I've become kind of intereseted in text 


classification after reading some of the papers here. For me spam isn't 


a big issue yet (getting 2-5 spam mails daily can be supported), but 


still I hope you are progressing well. Maybe later on I'll install it 


too.




Now the issue I'm writing about: there has been a lot of talk about the 


tokenizing part (should it include the html tags or not, etc.), but I 


would like to add one idea of my own: as you know in html pages 


characters can be written as #<number>;, where <number> represents the 


ASCII (or maybe UNICODE - I'm not sure) code of the character. Now, if 


you don't convert these characters back to their corresponding values, a 


spammer could use almost infinite variations of words (writing 


hel#<coder for l>;o or #<code for h>ello or replacing multiple 


characters with their codes). So my suggestion would be to convert them 


back. PHP has already such a function (html_entity_decode), so you could 


take a look at their source code for help.




-Cd-MaN




P.S. Sorry if this issue has already been bought up.










More information about the Spambayes mailing list