[spambayes-dev] Idea to re-energize corpus learning

Skip Montanaro skip at pobox.com
Mon Nov 17 08:47:30 EST 2003

    Martin> I recently started this thread on the POPFile forum, but it
    Martin> applies just as well to SpamBayes.

    Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099

See my note from Sunday on spambayes-dev:


Just because you train on a gazillion spams and hams doesn't mean the best
course once you've screwed something up isn't to start over.  Like I said in
the above message, I think there's a certain psychological barrier you have
to overcome before you throw out a massive training database.  I suspect
POPfile learns about as quickly as SpamBayes, so without proof I assert that
starting over there is often going to be the right course as well.

For example, it's rather easy for me to scan my current training database
for mistakes, either in a semi-automated fashion using sb_filter.py or
manually, because it only contains about 250 messages.  This was extremely
difficult using my previous monster database (15k-20k messages).


