[spambayes-dev] Re: Idea to re-energize corpus learning
Martin Stone Davis
m0davis at pacbell.net
Mon Nov 17 20:22:09 EST 2003
Skip Montanaro wrote:
>>>>>>"Martin" == Martin Stone Davis <m0davis at pacbell.net> writes:
> Martin> Skip Montanaro wrote:
> Martin> I recently started this thread on the POPFile forum, but it
> Martin> applies just as well to SpamBayes.
> Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
> >> See my note from Sunday on spambayes-dev:
> >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html
> Martin> Wouldn't it be nice if there were some middle ground between
> Martin> continuing to train the huge immovable database and starting
> Martin> over fresh?
> Sure, it would, but why propagate mistakes, even if they are smaller in
> magnitude? I should have continued my previous message instead of leaving
> people to draw their own conclusions. With a small database, if you have an
> error, it's easier to find, and if you can't find it, starting from scratch
> is not a big problem. With a large database there's this feeling that,
> "but... but... but... I'll be throwing away all that *good* data and all my
> (valuable) work!"
> Martin> Having to train 100% of incoming messages after starting over is
> Martin> real work, and especially frustrating when you *know* that
> Martin> 80-90% would have been correctly classified anyway if only you
> Martin> hadn't started over.
> If you only train on mistakes and unsures (as many of us appear to do now),
> then the effort is lessened. I don't see any practical benefit to training
> on every Python-related message I receive as ham. I currently have about 20
> in my training database. If I was smart, I could probably figure out how to
> reduce that number. As far as I can tell, nearly every valid Python-related
> message I receive gets a ham score of 0.00 (rounded). None get scored as
> unsure or spam. How long should I beat that particular dead horse?
> Since blowing away my gazillion message training database I've started from
> scratch twice. Considering the volume of mail I get, getting back to a
> 250-message training database is little effort at all for me. SpamBayes
> seems to start scoring most stuff pretty well after seeing just a few hams
> and spams, so the cost is minimal. The problem with spam is that it varies
> all over the map (subject wise). My hams fall into just a few categories
> though, so good messages begin to be correctly classified almost
> immediately. Spam tends to linger in the unsure category must longer. My
> current approach to that problem is to try and push my spam_cutoff down
> If you want to seed a training database, you might try initially adding just
> the most recent message from each of your active ham mailboxes. I could add
> just ten messages and be almost certain they would all be useful indicators
> of ham. Once I've added a few spams, I'd probably see pretty good
> classification results.
> Given a 20k-message training database which contains mistakes, I will have a
> hard time finding and correcting those mistakes. Your approach is to reduce
> the magnitude of the mistakes by reducing the weight of the current training
> database. I effectively take the same approach, it's just that I've
> actually deleted the mistakes. I've thrown the baby out with the bath water
> (you just shrink your babies ;-), but I get plenty of babies in my incoming
> mail feed. If I'm careful, perhaps I'll avoid introducing the same mistakes
> next time.
> Martin> Doesn't that sound like a good idea?
> I suppose. Mine doesn't require any new code to be written though.
> I'm really not saying your idea is bad, just that mine ought to be "good
> enough" and requires no extra code to be written.
I get your point. But for whatever reason, I am just much less tolerant
than you of having to futz with the training database. Even if it isn't
*perfect*, I feel better about shrinking those babies than throwing them
out, since I really *hate* having to meet new babies. Okay, we've
stretch that analogy far enough!
> You should be able to
> write a little Python script which will march through your database and
> reduce the counts by appropriate amounts. You will have to be aware of a
> couple corner conditions:
> * The counts for some words will round to zero. You have to decide
> whether to keep them as hapaxes or delete them altogether.
> * Roundoff error might leave you with some assertion errors like the
> assert hamcount <= nham
> assert spamcount <= nspam
> You'll also have to take care to avoid that case.
Ah, but you see: I'm too lazy to learn enough Python to get that to
work. But if I ever do try, thanks for the pointers.
> One thing I tried in the past was to whack off the oldest 10%-20% of my
> training database and retrain on the result.
Hold it right there. Whack off? hehe hehehe hheheheheheehehe.
> That's another option to try
> to remove errors. If you as a trainer get better at your job, over time you
> will also reduce the number of mistakes in your training database. This
> approach also has the pleasant side effect of deleting old messages, keeping
> your training data more current as the nature of spam shifts. If you
> initially trained on a large body of saved mail though, you might wind up
> whacking out many/most/all the clues pertaining to a particular subject area
> and have to add some new messages in to compensate.
Let's call it the "kill the oldest babies" method. I actually thought
about that one first before I came up with the shrinking babies. I
figured that I would prefer shrinking them since I wouldn't usually know
how much I liked those older babies.
Thanks for the input,
More information about the spambayes-dev