[Spambayes] train-to-exhaustion questions

Fri Apr 27 15:12:49 CEST 2007

    >> new{ham,spam}.old.cull are the files written by tte itself.  

    Dave> You mean, as a result of passing '-c'?

Yeah, it's been probably a year or so since I made any changes to either
tte.py or my tte shell script.  I just recall the command line from my bash
history and blindly hit RET.

    >> newham and newspam are the messages I've saved from my mailer since
    >> the last run.

    Dave> Meaning, the new unsure (and, god forbid, misclassified) messages?

Yup.

    Dave> Here's another question: is the ratio argument really the best
    Dave> interface?  Seems to me that if you keep the number of hams and
    Dave> spams very close to one another, specifying a ratio that uses all
    Dave> the training data is very difficult (you have to count all the
    Dave> messages manually).  Wouldn't it be better to have an --unbalanced
    Dave> argument that automatically counts and causes all the training
    Dave> data to get used?

It's served me well.  You might be the second user.  Inputs welcome, but see
below.

    Dave> And another: the purpose of the -R argument wasn't clear to me,
    Dave> but I started using it on the assumption that when things get
    Dave> slightly out-of-balance I was likely to miss training on the
    Dave> newest data if the algorithm started at the beginning of the
    Dave> mailbox.  Is that the intended use?

I use mbox format.  The newest messages are at the end.  Normally, the more
recent messages are more valuable as an indication of current spam
practices.  -R simply reverses the order you process the mbox.

The fact that I currently have something like 30 more spams than hams isn't
a big deal to me.  Since I use -R those messages which are ignored are the
oldest ones, and presumably the least suggestive of current spammers'
practice.  If I feel the need to process them I'll change the RATIO vrbl to
3:2.  Or I'll visit the saved spam mbox in my mail reader and simply delete
the 30 oldest messages.

I have come to view spam filtering as both a dynamic process and necessarily
imprecise.  The former means that my ham/spam collections should change over
time.  I think of my database as a sliding window of messages instead of a
record of everything I've encountered in the past.  The latter means I don't
get too bent out of shape about the occasional misclassification.  Trying
too hard to get everything to classify perfectly just isn't worth the
effort, and leads to other problems.  For example, I get the occasional
message from retailers I've bought products from in the past (Barnes & Noble
comes to mind).  Sometimes their messages classify as ham, sometimes unsure,
sometimes spam.  Gmail lets so few spams through that I wind up scanning my
spam mailbox frequently anyway, and I never treat a B&N message
classification as a "mistake", no matter where it lands.

Skip