[Spambayes] Analyzing text in image spam (was: Spam in Images)

Sun Nov 5 20:24:05 CET 2006

    Luigi> Once both are enabled it seems to work but the mail processing is
    Luigi> very very slow.

    >> First time through, yes.  After that, it should (in theory) rely on
    >> its cache of IP address information.  I may have some pending
    >> checkins for that though (*).  Note also that a fairly small training
    >> database works for me (fewer than 100 hams, 250-300 spams).  If you
    >> have a massive training database, then, yes, this will slow things
    >> down dramatically.  The IP lookup and image OCR stuff changes the
    >> properties of your database enough that I think it's worth retraining
    >> from scratch.

    Luigi> I have tried on a sample of 5000 emails but I stopped it because
    Luigi> after more than half an hour it didn't finish. From tcpdump I
    Luigi> could see a request every 1,2 seconds (or something like that)
    Luigi> now even considering that not every mail contains an url it was
    Luigi> very slow.  As a note I tried it on windows XP with ocr scanning
    Luigi> enabled but ocr alone was much faster.

I can't imagine a scenario where I would need 5000 emails to get decent
results with SpamBayes.  If that was the common case, everyone would give up
on it long before it was of any use.  I still suggest you try starting from


