[Spambayes] RE: Need more training messages

Skip Montanaro skip at pobox.com
Tue Sep 30 10:22:17 EDT 2003

    Bob> Well, either...

    Bob> - There are users of Spambayes in orthographically diverse
    Bob>   languages, in which case those users should be able to contribute
    Bob>   some ham samples, as well as their experience regarding the
    Bob>   accuracy of Spambayes's classification in their languages, or

    Bob> - There aren't, in which case Spambayes's performance on ham
    Bob>   written in such languages is (at least for the time being)
    Bob>   irrelevant.

I agree those are the two cases (<wink>), but don't agree with your
conclusions.  If we distribute SpamBayes with a default db that performs
miserably on Asian ham, we're not likely to win a lot of support.  I would
prefer that the default database process what current users normally
encounter in a reasonable way.  I don't expect people will be able to avoid
training altogether.  I just don't want the first couple of batches to all
score 0.5.

    Bob> But having said that, I have a broader confusion. I had thought
    Bob> that a few weeks ago the Spambayes development community had become
    Bob> convinced that it was just as effective to start with zero messages
    Bob> and let the program build its database from scratch. The theory (I
    Bob> thought) was that the tedium of dealing initially with the
    Bob> classification of all messages as ambiguous would be balanced by
    Bob> the fact that other users' idiosyncratic definitions of what is or
    Bob> isn't spam would be removed from the equation. Indeed, wasn't the
    Bob> ability to start from scratch specifically added to the most recent
    Bob> Spambayes versions to accommodate this thinking?

    Bob> Am I remembering wrong? Or have the developers changed their minds?
    Bob> If not, why the renewed emphasis on the starter database?

There's no "emphasis" on a starter database.  It's just worth taking a look
at.  Think of it as yet another test.  I now have a database with 18 spams
and 22 hams.  It's a 280kbyte pickle (89kbytes zipped).  I'll try and run
some tests today to see how it does.  If you'd like to try it out, it's at



