[Spambayes] how spambayes handles image-only spams
skip at pobox.com
Mon Sep 8 16:09:55 EDT 2003
Ryan> Most, but not all, of the useful content. For instance, in my
Ryan> corpora, HTML comments are only present in spam. Wouldn't an
Ryan> html:comment token provide more discriminating information than
Ryan> skipping the token altogether?
Maybe. Try it and see. I know you begged off because of your Python
beginner status in an earlier message, however the tokenizer is pretty
straightforward stuff. If you check out the code from cvs, the worst that
can happen is you screw it up so badly you have to delete the broken file(s)
and execute "cvs up" to get back to a stable baseline.
Ryan> Also, looking for a FONT tag with a COLOR=<white> would help
Ryan> discriminate the "image-only" spams I see. The only reason I
Ryan> raised this issue is a few "image-only" spams with white-on-white
Ryan> random text have gotten by my SpamBayes (007 Plug-in) filter.
Again, you have to try it and see. I think we've established fairly well
that most of the SpamBayes developers get very little valid HTML email. If
we were to try any of your suggestions using our existing training databases
the results would be inconclusive, at best.
Ryan> That said, I'm getting a 96.3% capture rate now, with zero false
Ryan> positives to date. I'm very happy with SpamBayes, I just want to
Ryan> help make it a little bit better if possible. My intuition tells
Ryan> me ignoring HTML tags is ignoring useful content, but I could be
Ryan> totally wrong.
Intuition is a trap which was laid (apparently by Bayes himself) for
everyone whose SourceForge userids are associated with the tokenizer
Ryan> I'm going to figure out how to add these tokens to a customized
Ryan> parser on my own, and report on the results. I'll see if they help
Ryan> at all.
Why do you need a customized parser? You'd probably reach your end goal
faster by reading and modifying tokenizer.py. If you have questions about
it, post to spambayes-dev at python.org. I'm sure a few rudimentary Python
questions not directly related to SpamBayes would probably be tolerated, at
least in the context of a SpamBayes-related post, but if you have a lot of
them, you'd be better off posting such missives to help at python.org or by
joining the tutor at python.org mailing list.
More information about the Spambayes