[Spambayes] RE: Need more training messages

Skip Montanaro skip at pobox.com
Tue Sep 30 10:22:17 EDT 2003

    Bob> Well, either...

    Bob> - There are users of Spambayes in orthographically diverse
    Bob>   languages, in which case those users should be able to contribute
    Bob>   some ham samples, as well as their experience regarding the
    Bob>   accuracy of Spambayes's classification in their languages, or

    Bob> - There aren't, in which case Spambayes's performance on ham
    Bob>   written in such languages is (at least for the time being)
    Bob>   irrelevant.

I agree those are the two cases (<wink>), but don't agree with your
conclusions.  If we distribute SpamBayes with a default db that performs
miserably on Asian ham, we're not likely to win a lot of support.  I would
prefer that the default database process what current users normally
encounter in a reasonable way.  I don't expect people will be able to avoid
training altogether.  I just don't want the first couple of batches to all
score 0.5.

    Bob> But having said that, I have a broader confusion. I had thought
    Bob> that a few weeks ago the Spambayes development community had become
    Bob> convinced that it was just as effective to start with zero messages
    Bob> and let the program build its database from scratch. The theory (I
    Bob> thought) was that the tedium of dealing initially with the
    Bob> classification of all messages as ambiguous would be balanced by
    Bob> the fact that other users' idiosyncratic definitions of what is or
    Bob> isn't spam would be removed from the equation. Indeed, wasn't the
    Bob> ability to start from scratch specifically added to the most recent
    Bob> Spambayes versions to accommodate this thinking?

    Bob> Am I remembering wrong? Or have the developers changed their minds?
    Bob> If not, why the renewed emphasis on the starter database?

There's no "emphasis" on a starter database.  It's just worth taking a look
at.  Think of it as yet another test.  I now have a database with 18 spams
and 22 hams.  It's a 280kbyte pickle (89kbytes zipped).  I'll try and run
some tests today to see how it does.  If you'd like to try it out, it's at



More information about the Spambayes mailing list