[Spambayes] Watch out for this

Wed Sep 10 07:59:31 EDT 2003

Hello.

Sorry for writing without registering to any mailing list or at sf.net, 

but I'm kind of lazy. Anyway, I've become kind of intereseted in text 

classification after reading some of the papers here. For me spam isn't 

a big issue yet (getting 2-5 spam mails daily can be supported), but 

still I hope you are progressing well. Maybe later on I'll install it 

too.

Now the issue I'm writing about: there has been a lot of talk about the 

tokenizing part (should it include the html tags or not, etc.), but I 

would like to add one idea of my own: as you know in html pages 

characters can be written as #<number>;, where <number> represents the 

ASCII (or maybe UNICODE - I'm not sure) code of the character. Now, if 

you don't convert these characters back to their corresponding values, a 

spammer could use almost infinite variations of words (writing 

hel#<coder for l>;o or #<code for h>ello or replacing multiple 

characters with their codes). So my suggestion would be to convert them 

back. PHP has already such a function (html_entity_decode), so you could 

take a look at their source code for help.

-Cd-MaN

P.S. Sorry if this issue has already been bought up.