[Spambayes] Analyzing text in image spam (was: Spam in Images)
pl at symbolic.it
Fri Nov 3 17:32:13 CET 2006
On Fri, 2006-11-03 at 09:56 -0600, skip at pobox.com wrote:
> >> Once you're ready to go, add the following to your SpamBayes options:
> >> x-lookup_ip: True
> >> lookup_ip_cache: ~/.dnscache
> Luigi> Is someone using this option? To me seems that this option alone
> Luigi> do nothing. You have to enable both x-lookup_ip and
> Luigi> x-pick_apart_urls. Is it right or am I missing something?
> Perhaps. I can't recall. Do you have PyDNS installed?
Yes, I have PyDNS installed. I used tcpdump to monitor dns requests and
there are no requests if x-pick_apart_urls is disabled. Looking into the
code seems that the check for x-lookup_ip is inside a if(pick_url
> Luigi> Once both are enabled it seems to work but the mail processing is
> Luigi> very very slow.
> First time through, yes. After that, it should (in theory) rely on its
> cache of IP address information. I may have some pending checkins for that
> though (*). Note also that a fairly small training database works for me (fewer
> than 100 hams, 250-300 spams). If you have a massive training database,
> then, yes, this will slow things down dramatically. The IP lookup and image
> OCR stuff changes the properties of your database enough that I think it's
> worth retraining from scratch.
I have tried on a sample of 5000 emails but I stopped it because after
more than half an hour it didn't finish. From tcpdump I could see a
request every 1,2 seconds (or something like that) now even considering
that not every mail contains an url it was very slow.
As a note I tried it on windows XP with ocr scanning enabled but ocr
alone was much faster.
> (*) Alas, I didn't get around to checking stuff in last night. Maybe over
> the weekend.
V.le Mentana, 29
Tel: +39 0521 708811
Fax: +39 0521 776190
More information about the SpamBayes