[Spambayes] RE: Need more training messages

Tue Sep 30 12:32:21 EDT 2003

> From: Skip Montanaro [mailto:skip at pobox.com]
> Sent: Tuesday, September 30, 2003 10:22 AM
> To: Coe, Bob
> Cc: spambayes at python.org
> Subject: RE: [Spambayes] RE: Need more training messages
> 
> 
>     Bob> Well, either...
> 
>     Bob> - There are users of Spambayes in orthographically diverse
>     Bob>   languages, in which case those users should be able to
>     Bob>   contribute some ham samples, as well as their experience 
>     Bob>   regarding the accuracy of Spambayes's classification in
>     Bob>   their languages, or
> 
>     Bob> - There aren't, in which case Spambayes's performance on ham
>     Bob>   written in such languages is (at least for the time being)
>     Bob>   irrelevant.
> 
> I agree those are the two cases (<wink>), but don't agree with your
> conclusions.  If we distribute SpamBayes with a default db that
> performs miserably on Asian ham, we're not likely to win a lot of
> support.  I would prefer that the default database process what
> current users normally encounter in a reasonable way. ...

I think I detect an a priori confidence that the same version of the Spambayes classifier, if properly trained, can work effectively on both European and Asian languages. I wonder if that confidence isn't unduly optimistic. For example, ...

I presume that the Spambayes classifier tokenizes the incoming character stream according to an algorithm that depends heavily on clearly defined word markers (spaces and punctuation marks) that are largely absent, or at least less prominent, in Chinese. But if you try to tokenize the individual characters of written Chinese, you'll find that they're much more context sensitive than the words of an English sentence are. To put it another way, many Chinese "words" consist of two or three characters, and the information theoretical redundancy of a given character is rather low (not as low as in spoken Chinese, but low enough to be a problem). Which means, I suspect, that the tokenizer, if it's to be effective on Chinese text, will require somewhat more lookahead capability than it probably has now. The problem isn't insurmountable, of course, but I think it casts doubt on the "One size fits all" approach.

Such problems as this probably don't affect the classifier's ability to differentiate Chinese text from English ham. And if you're a user for whom all Chinese text is spam, that should be good enough. But differentiating Chinese spam from Chinese ham may be beyond the capability of the current classifier.

(By way of disclaimer, I don't speak Chinese and I haven't seen the internals of the Spambayes classifier. But I've written compilers and other text processing software, so I'm not a total novice either.)

Bob

MIS Department, City of Cambridge
831 Massachusetts Ave, Cambridge MA 02139  ·  617-349-4217  ·  fax 617-349-6165