[spambayes-dev] Re: Idea to re-energize corpus learning

Mon Nov 17 20:22:09 EST 2003

Skip Montanaro wrote:

>>>>>>"Martin" == Martin Stone Davis <m0davis at pacbell.net> writes:
> 
> 
>     Martin> Skip Montanaro wrote:
>     Martin> I recently started this thread on the POPFile forum, but it
>     Martin> applies just as well to SpamBayes.
>     >> 
>     Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
>     >> 
>     >> See my note from Sunday on spambayes-dev:
>     >> 
>     >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html
> 
>     Martin> Wouldn't it be nice if there were some middle ground between
>     Martin> continuing to train the huge immovable database and starting
>     Martin> over fresh?
> 
> Sure, it would, but why propagate mistakes, even if they are smaller in
> magnitude?  I should have continued my previous message instead of leaving
> people to draw their own conclusions.  With a small database, if you have an
> error, it's easier to find, and if you can't find it, starting from scratch
> is not a big problem.  With a large database there's this feeling that,
> "but... but... but...  I'll be throwing away all that *good* data and all my
> (valuable) work!"
> 
>     Martin> Having to train 100% of incoming messages after starting over is
>     Martin> real work, and especially frustrating when you *know* that
>     Martin> 80-90% would have been correctly classified anyway if only you
>     Martin> hadn't started over.
> 
> If you only train on mistakes and unsures (as many of us appear to do now),
> then the effort is lessened.  I don't see any practical benefit to training
> on every Python-related message I receive as ham.  I currently have about 20
> in my training database.  If I was smart, I could probably figure out how to
> reduce that number.  As far as I can tell, nearly every valid Python-related
> message I receive gets a ham score of 0.00 (rounded).  None get scored as
> unsure or spam.  How long should I beat that particular dead horse?
> 
> Since blowing away my gazillion message training database I've started from
> scratch twice.  Considering the volume of mail I get, getting back to a
> 250-message training database is little effort at all for me.  SpamBayes
> seems to start scoring most stuff pretty well after seeing just a few hams
> and spams, so the cost is minimal.  The problem with spam is that it varies
> all over the map (subject wise).  My hams fall into just a few categories
> though, so good messages begin to be correctly classified almost
> immediately.  Spam tends to linger in the unsure category must longer.  My
> current approach to that problem is to try and push my spam_cutoff down
> further.
> 
> If you want to seed a training database, you might try initially adding just
> the most recent message from each of your active ham mailboxes.  I could add
> just ten messages and be almost certain they would all be useful indicators
> of ham.  Once I've added a few spams, I'd probably see pretty good
> classification results.
> 
> Given a 20k-message training database which contains mistakes, I will have a
> hard time finding and correcting those mistakes.  Your approach is to reduce
> the magnitude of the mistakes by reducing the weight of the current training
> database.  I effectively take the same approach, it's just that I've
> actually deleted the mistakes.  I've thrown the baby out with the bath water
> (you just shrink your babies ;-), but I get plenty of babies in my incoming
> mail feed.  If I'm careful, perhaps I'll avoid introducing the same mistakes
> next time.
> 
>     Martin> Doesn't that sound like a good idea?
> 
> I suppose.  Mine doesn't require any new code to be written though.
> 
> I'm really not saying your idea is bad, just that mine ought to be "good
> enough" and requires no extra code to be written.  

I get your point.  But for whatever reason, I am just much less tolerant 
than you of having to futz with the training database.  Even if it isn't 
*perfect*, I feel better about shrinking those babies than throwing them 
out, since I really *hate* having to meet new babies.  Okay, we've 
stretch that analogy far enough!

 > You should be able to
> write a little Python script which will march through your database and
> reduce the counts by appropriate amounts.  You will have to be aware of a
> couple corner conditions:
> 
>     * The counts for some words will round to zero.  You have to decide
>       whether to keep them as hapaxes or delete them altogether.
> 
>     * Roundoff error might leave you with some assertion errors like the
>       dreaded 
> 
>         assert hamcount <= nham
>         assert spamcount <= nspam 
> 
>       You'll also have to take care to avoid that case.

Ah, but you see: I'm too lazy to learn enough Python to get that to 
work.  But if I ever do try, thanks for the pointers.

> 
> One thing I tried in the past was to whack off the oldest 10%-20% of my
> training database and retrain on the result.  

Hold it right there.  Whack off?  hehe hehehe hheheheheheehehe.

 > That's another option to try
> to remove errors.  If you as a trainer get better at your job, over time you
> will also reduce the number of mistakes in your training database.  This
> approach also has the pleasant side effect of deleting old messages, keeping
> your training data more current as the nature of spam shifts.  If you
> initially trained on a large body of saved mail though, you might wind up
> whacking out many/most/all the clues pertaining to a particular subject area
> and have to add some new messages in to compensate.

Let's call it the "kill the oldest babies" method.  I actually thought 
about that one first before I came up with the shrinking babies.  I 
figured that I would prefer shrinking them since I wouldn't usually know 
how much I liked those older babies.

Aghhhhhhhhh babies!

Thanks for the input,
-Martin