[Spambayes] html checking

Meyer, Tony T.A.Meyer at massey.ac.nz
Tue Jun 17 19:22:03 EDT 2003


> Instead of ignoring html completely, how about checking for 
> invalid html tags and having that be an indicator for spam? 
> More and more spam that I receive now has invalid html tags 
> imbedded in the "keywords" to make them invisible to 
> scanners: pen<qwerty>is or via<hxpyqz>gra They are 
> "discarded" and the message appears normally.

The latest cvs (not alpha2 or the current Outlook binary) would change
these to "penis" and "viagra", respectively.  This means that they are
no longer hidden and the message is appropriately scored.

If messages are still being incorrectly classified, then you could
easily enough make the modifications to additionally create a new token
(although you would either need to generate a token for any html tag, or
have a list of valid ones).  It seems unlikely that that would be the
case, however.

Feel free to give it a go yourself though, if you are programmatically
minded.  If you post a patch, others will (may) run tests as well and
contribute their results.  If it turns out that this does have an impact
on classification results, then it would be added to the core code.

If you're not programmatically minded, then you have a much harder task!
You need to convince someone that is that they should do the above.  The
easiest way (unless someone else responds to this) is to open up a
feature request via the sf system (http://sf.net/projects/spambayes).
Note that you'll have to convince people that it may make a difference,
though.

=Tony Meyer



More information about the Spambayes mailing list