[Spambayes] RE: Trapping Spam messages that contain images...

Tue Oct 19 06:06:43 CEST 2004

> However these days I am receiving a new kind of Spam that sneaks
> through my defences and Spambayes cannot trap.

Try enabling some of the experimental options.  In particular, try:

 [Classifier] x-use_bigrams
 [Tokenizer] x-pick_apart_urls
 [Tokenizer] x-fancy_url_recognition
 [URLRetriever] x-slurp_urls

To try these with the Outlook plug-in, open (or create) the file
default_bayes_customize.ini in your data directory, and add the option(s),
like this:

[Classifier]
x-use_bigrams:True

> Anyhow, I had an idea to trap these type of messages that I 
> thought I might put out for discussion. [...] Basically 
> it would involve some additional functionality allow OCR 
> processing of images that are referenced on emails.

It has been a long time since I've done any OCR - is it really fast and
accurate enough to be useful in situations like this?  We'd also (probably)
need to use an open-source OCR library (rather than write our own), which
adds packaging complications.

It's possible it would help, but I suspect that it would be very expensive
for little gain.  Feel free to add it to the wiki http://entrian.com/sbwiki,
where there are other ideas to try out.

I get hardly any false negatives/unsures that are mostly images.  If I ever
do (and so have a testing corpus), then I think it would be interesting to
try a simpler scheme, where tokens are generated based on simple features
(perhaps Haar-like features) of the image, and the classifier uses those as
it would like.  The theory would be that it could pick up some features
common to good/bad images without looking for things like text.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.