[Spambayes] how spambayes handles image-only spams

Tue Sep 9 17:40:56 EDT 2003

>>> "Tim Peters" wrote
> spambayes was developed against many peoples' test corpora, although, as I
> said before, I don't think any of them had a significant quantity of HTML
> ham (and I don't think Bill's did either).  

Not true - my mungo-testset of 30K or so items had a significant amount of
HTML - this is why I jumped up and down on the HTML stuff. Tim's original
python-list archive had almost no HTML ham.

> In spambayes I'd be more inclined to write special code to identify the
> img-src-http dance, and synthesize a token for that.  It's only one token,
> though, and all tokens carry the same weight here -- it may still not be
> enough to give "a typical" short message of this ilk a strong enough score
> to nail it.  The only way to know is to try it.

Note also that synthesising a bunch of (highly correlated) clues to attempt
to fix the problem of the single URL spams often leads to unexpected (bad)
consequences with scoring other email. In general, highly correlated clues
are bad. This is why it's necessary to test additions to make sure that in
fixing one problem you're not adding 4 others. Someone should go through the
list archives and work out the ratio of attempted tokeniser tricks that 
made things worse to ones that actually improved the situation. I'd guess
that it's something like 4 rejected for every one that went in...

Anthony
-- 
Anthony Baxter     <anthony at interlink.com.au>   
It's never too late to have a happy childhood.