[Spambayes] train-to-exhaustion questions

David Abrahams dave at boost-consulting.com
Fri Apr 27 19:10:56 CEST 2007


on Fri Apr 27 2007, skip-AT-pobox.com wrote:

>     >> new{ham,spam}.old.cull are the files written by tte itself.  
>
>     Dave> You mean, as a result of passing '-c'?
>
> Yeah, it's been probably a year or so since I made any changes to either
> tte.py or my tte shell script.  I just recall the command line from my bash
> history and blindly hit RET.

Good to know.  I suppose there's no reason not to do this with a cron
job, if you're that confident in it.

>     Dave> Here's another question: is the ratio argument really the best
>     Dave> interface?  Seems to me that if you keep the number of hams and
>     Dave> spams very close to one another, specifying a ratio that uses all
>     Dave> the training data is very difficult (you have to count all the
>     Dave> messages manually).  Wouldn't it be better to have an --unbalanced
>     Dave> argument that automatically counts and causes all the training
>     Dave> data to get used?
>
> It's served me well.  You might be the second user.  Inputs welcome, but see
> below.

OK, seeing...

>     Dave> And another: the purpose of the -R argument wasn't clear to me,
>     Dave> but I started using it on the assumption that when things get
>     Dave> slightly out-of-balance I was likely to miss training on the
>     Dave> newest data if the algorithm started at the beginning of the
>     Dave> mailbox.  Is that the intended use?
>
> I use mbox format.  The newest messages are at the end.  Normally, the more
> recent messages are more valuable as an indication of current spam
> practices.  -R simply reverses the order you process the mbox.

Yeah, the same ordering logic applies to IMAP, at least on my server.

> The fact that I currently have something like 30 more spams than hams isn't
> a big deal to me.  

Me neither, especially when training on about 850 messages ;-)
If I used your culling approach I'd be training on much fewer, I
guess.

> Since I use -R those messages which are ignored are the oldest ones,
> and presumably the least suggestive of current spammers' practice.

Yes, I expected that to be the logic.

> If I feel the need to process them I'll change the RATIO vrbl to
> 3:2.  Or I'll visit the saved spam mbox in my mail reader and simply
> delete the 30 oldest messages.

OK... but what will happen if the real ratio of ham to spam is more
like 412:379 and I pass a simple ratio of 3:2?  If I understand tte.py
correctly, instead of skipping (412-379)=33 spams it would only
actually end up training 411 spams and 274 hams, thus skipping one
spam and (379-274)=105 hams.

I guess I'm saying that the ratio argument is good for training some
specific ratio of hams and spams... but does anyone really want to
train a specific ratio?  What's the use case? If you've supplied the
ratio argument to make it easy for people to train everything in an
unbalanced set, it's not a very good way of getting there.


> I have come to view spam filtering as both a dynamic process and necessarily
> imprecise.  The former means that my ham/spam collections should change over
> time.  I think of my database as a sliding window of messages instead of a
> record of everything I've encountered in the past.  The latter means I don't
> get too bent out of shape about the occasional misclassification.  

I have come to expect that SB won't misclassify my Ham as Spam; it has
had a pretty good track record for me (although since I started using
tte.py I have had a few more of those than I'm used to).  I don't mind
the other kind of misclassification, occasionally.

> Trying too hard to get everything to classify perfectly just isn't worth the
> effort, 

Sure; at some point you want to just delete a few spams here and
there, and not worry about the filtering system.

> and leads to other problems.  For example, I get the occasional
> message from retailers I've bought products from in the past (Barnes & Noble
> comes to mind).  Sometimes their messages classify as ham, sometimes unsure,
> sometimes spam.  Gmail lets so few spams through that I wind up scanning my
> spam mailbox frequently anyway, and I never treat a B&N message
> classification as a "mistake", no matter where it lands.

Yeah, gmail... I've been thinking of routing all incoming mail on my
server through my gmail account somehow, just to take advantage of
their filtering ;-)

Unfortunately, I want to keep my email address and my server, so
unless Google is going to make their spam blocking technology public
it means SB is going to have to take on the whole job.

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com



More information about the SpamBayes mailing list