[Spambayes] Analyzing text in image spam (was: Spam in Images)

Luigi Pugnetti pl at symbolic.it
Fri Nov 3 17:32:13 CET 2006


On Fri, 2006-11-03 at 09:56 -0600, skip at pobox.com wrote:
>     >> Once you're ready to go, add the following to your SpamBayes options:
>     >> 
>     >> x-lookup_ip: True
>     >> lookup_ip_cache: ~/.dnscache
>     >> 
> 
>     Luigi> Is someone using this option?  To me seems that this option alone
>     Luigi> do nothing. You have to enable both x-lookup_ip and
>     Luigi> x-pick_apart_urls.  Is it right or am I missing something?
> 
> Perhaps.  I can't recall.  Do you have PyDNS installed?
Yes, I have PyDNS installed. I used tcpdump to monitor dns requests and
there are no requests if x-pick_apart_urls is disabled. Looking into the
code seems that the check for x-lookup_ip is inside a if(pick_url
enabled) construct

> 
>     Luigi> Once both are enabled it seems to work but the mail processing is
>     Luigi> very very slow.
> 
> First time through, yes.  After that, it should (in theory) rely on its
> cache of IP address information.  I may have some pending checkins for that
> though (*).  Note also that a fairly small training database works for me (fewer
> than 100 hams, 250-300 spams).  If you have a massive training database,
> then, yes, this will slow things down dramatically.  The IP lookup and image
> OCR stuff changes the properties of your database enough that I think it's
> worth retraining from scratch.

I have tried on a sample of 5000 emails but I stopped it because after
more than half an hour it didn't finish. From tcpdump I could see a
request every 1,2 seconds (or something like that) now even considering
that not every mail contains an url it was very slow. 
As a note I tried it on windows XP with ocr scanning enabled but ocr
alone was much faster.

> 
> Skip
> 
> (*) Alas, I didn't get around to checking stuff in last night.  Maybe over
> the weekend.
> 
> S
-- 
Luigi Pugnetti

Symbolic S.p.A.
V.le Mentana, 29
I-43100 Parma
Italy

Tel: +39 0521 708811
Fax: +39 0521 776190




More information about the SpamBayes mailing list