[Spambayes] Spam Prefiltering

David Abrahams dave at boost-consulting.com
Sat May 22 22:03:02 EDT 2004


"Tony Meyer" <tameyer at ihug.co.nz> writes:

>> So now the question is, what to do?  Clearly I don't want to route
>> these into my spam training folder automatically unless I have a way
>> to balance them with Ham.  Will I get better results from SpamBayes if
>> I give it a chance to learn from these messages (i.e. send it to my
>> INBOX and let SpamBayes filter them), or should I just discard them?
>
> What sort of results are you getting?  

Pretty great, though I get 8-20 messages classified as "unsure" each
day.

> If you are happy with them, then I'd say there was no reason to
> change.

Well, I'm pretty sure that the creeping (and sometimes not-so-creeping
-- the blacklists are catching *lots* of messages and my Spam training
folder grew by over 3000 messages over the course of a week) Ham/Spam
imbalance is hurting performance.

> If you're getting false positives from the blacklisting, then I'd
> recommend dumping it.  

I honestly don't know, as I only just noticed that the blacklisting
was happening.  I think I did have *one* incident where a blacklist
caused a problem.

> However, as long as you don't go with the
> "reject it" option, the mail is still there, so at least you can
> find it (if it's mail you're expecting, for example).
>
> Are there important time factors?  

Maybe.  

> I assume that the blacklisting is running
> on a server, and SpamBayes is running locally.  

Nope, it's all server side.

> Would the extra volume of mail through SpamBayes have a significant
> effect on the time it took to filter incoming mail?

Possibly, since I need to mail it to myself in order to classify it
(a Communigate Pro limitation)

> Are there important bandwidth factors?  I also assume that the mail is
> stored remotely

Yes, but given your assumptions...  How would remote storage and local
filtering work?  Wouldn't that just waste lots of space on the server?

> so mail handled by the blacklisting system is never transferred to
> your local system (unless you manually look at it).  If SpamBayes
> has to handle it, this means a lot more traffic.  Does that matter?

I hope I answered that.

> Try classifying some of the blacklisted messages.  If they're classed as
> spam without training on them, then there probably isn't any worthwhile
> information in there anyway.

Good point.

> This is just my 2c, of course.  Maybe you could try feeding them all to
> SpamBayes for a while, see how that goes, and decide after that?

Well, I could.  I guess the main question is, which to try first?  

  1. feed them all to SpamBayes

  2. discard the blacklisted ones or 

  3. send them somewhere that doesn't affect training
  
I'm inclined to try #2 for a while.

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com




More information about the Spambayes mailing list