[Spambayes] how spambayes handles image-only spams

Mon Sep 8 16:09:55 EDT 2003

    Ryan> Most, but not all, of the useful content. For instance, in my
    Ryan> corpora, HTML comments are only present in spam. Wouldn't an
    Ryan> html:comment token provide more discriminating information than
    Ryan> skipping the token altogether? 

Maybe.  Try it and see.  I know you begged off because of your Python
beginner status in an earlier message, however the tokenizer is pretty
straightforward stuff.  If you check out the code from cvs, the worst that
can happen is you screw it up so badly you have to delete the broken file(s)
and execute "cvs up" to get back to a stable baseline.

    Ryan> Also, looking for a FONT tag with a COLOR=<white> would help
    Ryan> discriminate the "image-only" spams I see. The only reason I
    Ryan> raised this issue is a few "image-only" spams with white-on-white
    Ryan> random text have gotten by my SpamBayes (007 Plug-in) filter.

Again, you have to try it and see.  I think we've established fairly well
that most of the SpamBayes developers get very little valid HTML email.  If
we were to try any of your suggestions using our existing training databases
the results would be inconclusive, at best.

    Ryan> That said, I'm getting a 96.3% capture rate now, with zero false
    Ryan> positives to date. I'm very happy with SpamBayes, I just want to
    Ryan> help make it a little bit better if possible. My intuition tells
    Ryan> me ignoring HTML tags is ignoring useful content, but I could be
    Ryan> totally wrong.

Intuition is a trap which was laid (apparently by Bayes himself) for
everyone whose SourceForge userids are associated with the tokenizer
module. ;-)

    Ryan> I'm going to figure out how to add these tokens to a customized
    Ryan> parser on my own, and report on the results. I'll see if they help
    Ryan> at all.

Why do you need a customized parser?  You'd probably reach your end goal
faster by reading and modifying tokenizer.py.  If you have questions about
it, post to spambayes-dev at python.org.  I'm sure a few rudimentary Python
questions not directly related to SpamBayes would probably be tolerated, at
least in the context of a SpamBayes-related post, but if you have a lot of
them, you'd be better off posting such missives to help at python.org or by
joining the tutor at python.org mailing list.

Skip