[spambayes-dev] Re: Idea to re-energize corpus learning

Mon Nov 17 13:28:36 EST 2003

>>>>> "Martin" == Martin Stone Davis <m0davis at pacbell.net> writes:

    Martin> Skip Montanaro wrote:
    Martin> I recently started this thread on the POPFile forum, but it
    Martin> applies just as well to SpamBayes.
    >> 
    Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
    >> 
    >> See my note from Sunday on spambayes-dev:
    >> 
    >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html

    Martin> Wouldn't it be nice if there were some middle ground between
    Martin> continuing to train the huge immovable database and starting
    Martin> over fresh?

Sure, it would, but why propagate mistakes, even if they are smaller in
magnitude?  I should have continued my previous message instead of leaving
people to draw their own conclusions.  With a small database, if you have an
error, it's easier to find, and if you can't find it, starting from scratch
is not a big problem.  With a large database there's this feeling that,
"but... but... but...  I'll be throwing away all that *good* data and all my
(valuable) work!"

    Martin> Having to train 100% of incoming messages after starting over is
    Martin> real work, and especially frustrating when you *know* that
    Martin> 80-90% would have been correctly classified anyway if only you
    Martin> hadn't started over.

If you only train on mistakes and unsures (as many of us appear to do now),
then the effort is lessened.  I don't see any practical benefit to training
on every Python-related message I receive as ham.  I currently have about 20
in my training database.  If I was smart, I could probably figure out how to
reduce that number.  As far as I can tell, nearly every valid Python-related
message I receive gets a ham score of 0.00 (rounded).  None get scored as
unsure or spam.  How long should I beat that particular dead horse?

Since blowing away my gazillion message training database I've started from
scratch twice.  Considering the volume of mail I get, getting back to a
250-message training database is little effort at all for me.  SpamBayes
seems to start scoring most stuff pretty well after seeing just a few hams
and spams, so the cost is minimal.  The problem with spam is that it varies
all over the map (subject wise).  My hams fall into just a few categories
though, so good messages begin to be correctly classified almost
immediately.  Spam tends to linger in the unsure category must longer.  My
current approach to that problem is to try and push my spam_cutoff down
further.

If you want to seed a training database, you might try initially adding just
the most recent message from each of your active ham mailboxes.  I could add
just ten messages and be almost certain they would all be useful indicators
of ham.  Once I've added a few spams, I'd probably see pretty good
classification results.

Given a 20k-message training database which contains mistakes, I will have a
hard time finding and correcting those mistakes.  Your approach is to reduce
the magnitude of the mistakes by reducing the weight of the current training
database.  I effectively take the same approach, it's just that I've
actually deleted the mistakes.  I've thrown the baby out with the bath water
(you just shrink your babies ;-), but I get plenty of babies in my incoming
mail feed.  If I'm careful, perhaps I'll avoid introducing the same mistakes
next time.

    Martin> Doesn't that sound like a good idea?

I suppose.  Mine doesn't require any new code to be written though.

I'm really not saying your idea is bad, just that mine ought to be "good
enough" and requires no extra code to be written.  You should be able to
write a little Python script which will march through your database and
reduce the counts by appropriate amounts.  You will have to be aware of a
couple corner conditions:

    * The counts for some words will round to zero.  You have to decide
      whether to keep them as hapaxes or delete them altogether.

    * Roundoff error might leave you with some assertion errors like the
      dreaded 

        assert hamcount <= nham
        assert spamcount <= nspam 

      You'll also have to take care to avoid that case.

One thing I tried in the past was to whack off the oldest 10%-20% of my
training database and retrain on the result.  That's another option to try
to remove errors.  If you as a trainer get better at your job, over time you
will also reduce the number of mistakes in your training database.  This
approach also has the pleasant side effect of deleting old messages, keeping
your training data more current as the nature of spam shifts.  If you
initially trained on a large body of saved mail though, you might wind up
whacking out many/most/all the clues pertaining to a particular subject area
and have to add some new messages in to compensate.

Skip