[spambayes-dev] A spectacular false positive

Skip Montanaro skip at pobox.com
Sun Nov 16 22:13:16 EST 2003


    >> Especially since more & more of us are inclining toward using tiny
    >> databases (compared to what we used to do), making space for a "last
    >> used" timestamp may not be nearly as scary as it used to be.

    Alex> This is something that I don't understand... why do we care if the
    Alex> database is huge?  With 100 gigabyte drives commonplace, why are
    Alex> we quibbling over 20 or 40 megabytes?

It's not an issue of 20-40 megabytes, it's how many messages are represented
by that file.  In my case, I had a training database of around 21MB and on
the order of 10,000 ham and somewhat fewer spam (maybe 7,000 or so),
depending on how agressively I'd been training and how recently I'd whacked
off the oldest 10%-20% of my messages.

I think there's a psychological hurdle to overcome to simply throw away
17,000 messages, even if it's not working optimally, because it does
represent a substantial time investment.  That hurdle is much lower when
your training database is under 500 messages.  Heck, I can rebuild one of
that size in next to no time.

Here's something I think would be interesting.  At the moment I have about
40 unsures awaiting a decision from me (train or discard).  I'm trying
conciously to be conservative.  What I'd like to know is which message, if
added to my training database, would have the greatest effect on the scores
of the other unsure messages.  That would help me decide which ones yield
the most benefit.  OTOH, maybe I'd do just as well to train on every fourth
unsure or select unsures to train on with a probability of 0.25 (1/4 picked
purely out of thin air, so don't ask where I got it :-).

Skip




More information about the spambayes-dev mailing list