>     Skip> I doubt a few non-English hams and spam would hurt.  Let's
>     Skip> limit it to Western European languages (no Hebrew or Japanese,
>     Skip> for example).
>     Bob> I don't see the point of the limitation to Western European
>     Bob> spams. I'm firmly in the English-speaking world (no wisecracks
>     Bob> from the British Empire, please!), but a high percentage of my
>     Bob> spam is in Russian, Chinese, Japanese, etc.
> We know very little about how well SpamBayes works on *ham* which is
> written in non-Western European character sets.  The idea is that we
> provide an initial training database which allows SpamBayes to do a
> reasonable job scoring mail at the start.  I wouldn't want to include
> Asian spam and no Asian ham.  If a Japanese user installs SB and uses
> the starter database, they would likely be disappointed.
Well, either...

- There are users of Spambayes in orthographically diverse languages, in which case those users should be able to contribute some ham samples, as well as their experience regarding the accuracy of Spambayes's classification in their languages, or

- There aren't, in which case Spambayes's performance on ham written in such languages is (at least for the time being) irrelevant.

But having said that, I have a broader confusion. I had thought that a few weeks ago the Spambayes development community had become convinced that it was just as effective to start with zero messages and let the program build its database from scratch. The theory (I thought) was that the tedium of dealing initially with the classification of all messages as ambiguous would be balanced by the fact that other users' idiosyncratic definitions of what is or isn't spam would be removed from the equation. Indeed, wasn't the ability to start from scratch specifically added to the most recent Spambayes versions to accommodate this thinking?

Am I remembering wrong? Or have the developers changed their minds? If not, why the renewed emphasis on the starter database?


