[Spambayes] RE: Need more training messages
skip at pobox.com
Tue Sep 30 10:22:17 EDT 2003
Bob> Well, either...
Bob> - There are users of Spambayes in orthographically diverse
Bob> languages, in which case those users should be able to contribute
Bob> some ham samples, as well as their experience regarding the
Bob> accuracy of Spambayes's classification in their languages, or
Bob> - There aren't, in which case Spambayes's performance on ham
Bob> written in such languages is (at least for the time being)
I agree those are the two cases (<wink>), but don't agree with your
conclusions. If we distribute SpamBayes with a default db that performs
miserably on Asian ham, we're not likely to win a lot of support. I would
prefer that the default database process what current users normally
encounter in a reasonable way. I don't expect people will be able to avoid
training altogether. I just don't want the first couple of batches to all
Bob> But having said that, I have a broader confusion. I had thought
Bob> that a few weeks ago the Spambayes development community had become
Bob> convinced that it was just as effective to start with zero messages
Bob> and let the program build its database from scratch. The theory (I
Bob> thought) was that the tedium of dealing initially with the
Bob> classification of all messages as ambiguous would be balanced by
Bob> the fact that other users' idiosyncratic definitions of what is or
Bob> isn't spam would be removed from the equation. Indeed, wasn't the
Bob> ability to start from scratch specifically added to the most recent
Bob> Spambayes versions to accommodate this thinking?
Bob> Am I remembering wrong? Or have the developers changed their minds?
Bob> If not, why the renewed emphasis on the starter database?
There's no "emphasis" on a starter database. It's just worth taking a look
at. Think of it as yet another test. I now have a database with 18 spams
and 22 hams. It's a 280kbyte pickle (89kbytes zipped). I'll try and run
some tests today to see how it does. If you'd like to try it out, it's at
More information about the Spambayes