From: Skip Montanaro [mailto:skip@pobox.com] Sent: Tuesday, September 30, 2003 10:22 AM To: Coe, Bob Cc: spambayes@python.org Subject: RE: [Spambayes] RE: Need more training messages
Bob> Well, either...
Bob> - There are users of Spambayes in orthographically diverse Bob> languages, in which case those users should be able to Bob> contribute some ham samples, as well as their experience Bob> regarding the accuracy of Spambayes's classification in Bob> their languages, or
Bob> - There aren't, in which case Spambayes's performance on ham Bob> written in such languages is (at least for the time being) Bob> irrelevant.
I agree those are the two cases (<wink>), but don't agree with your conclusions. If we distribute SpamBayes with a default db that performs miserably on Asian ham, we're not likely to win a lot of support. I would prefer that the default database process what current users normally encounter in a reasonable way. ...
I think I detect an a priori confidence that the same version of the Spambayes classifier, if properly trained, can work effectively on both European and Asian languages. I wonder if that confidence isn't unduly optimistic. For example, ... I presume that the Spambayes classifier tokenizes the incoming character stream according to an algorithm that depends heavily on clearly defined word markers (spaces and punctuation marks) that are largely absent, or at least less prominent, in Chinese. But if you try to tokenize the individual characters of written Chinese, you'll find that they're much more context sensitive than the words of an English sentence are. To put it another way, many Chinese "words" consist of two or three characters, and the information theoretical redundancy of a given character is rather low (not as low as in spoken Chinese, but low enough to be a problem). Which means, I suspect, that the tokenizer, if it's to be effective on Chinese text, will require somewhat more lookahead capability than it probably has now. The problem isn't insurmountable, of course, but I think it casts doubt on the "One size fits all" approach. Such problems as this probably don't affect the classifier's ability to differentiate Chinese text from English ham. And if you're a user for whom all Chinese text is spam, that should be good enough. But differentiating Chinese spam from Chinese ham may be beyond the capability of the current classifier. (By way of disclaimer, I don't speak Chinese and I haven't seen the internals of the Spambayes classifier. But I've written compilers and other text processing software, so I'm not a total novice either.) Bob MIS Department, City of Cambridge 831 Massachusetts Ave, Cambridge MA 02139 · 617-349-4217 · fax 617-349-6165
>> I agree those are the two cases (<wink>), but don't agree with your >> conclusions. If we distribute SpamBayes with a default db that >> performs miserably on Asian ham, we're not likely to win a lot of >> support. I would prefer that the default database process what >> current users normally encounter in a reasonable way. ... Bob> I think I detect an a priori confidence that the same version of Bob> the Spambayes classifier, if properly trained, can work effectively Bob> on both European and Asian languages. I wonder if that confidence Bob> isn't unduly optimistic. For example, ... You're reading too much into my hen scratches. But why (possibly) needlessly prejudice a future segment of our population? Bob> I presume that the Spambayes classifier tokenizes the incoming Bob> character stream according to an algorithm that depends heavily on Bob> clearly defined word markers (spaces and punctuation marks) that Bob> are largely absent, or at least less prominent, in Chinese. Correct. Bob> insurmountable, of course, but I think it casts doubt on the "One Bob> size fits all" approach. I don't believe I suggested that. More likely than J. Random Yoshi in Tokyo picking up SpamBayes is a dual-language person (a student or green-card holder) in the US or Europe trying it out. That person is likely to get ham and spam in both European and Asian character sets. I'd like for their Asian ham to suddenly not all wind up in their spam folder because you and I can't read Chinese. It's quite possible that SpamBayes will fall flat on its face distinguishing Asian ham and spam anyway. I'd prefer that all be "unsure" to start with and let the user try training the different classes of mail. There are a couple native Chinese speakers in my group here at Northwestern. I should ask them if get any email written using Chinese character sets and would like to try out SB. Skip
[Bob Coe]
I think I detect an a priori confidence that the same version of the Spambayes classifier, if properly trained, can work effectively on both European and Asian languages. I wonder if that confidence isn't unduly optimistic. For example, ...
I expect SpamBayes in its current form to be useless for users whose email is primarily in Asian languages, unless all their ham is Asian and all their spam is non-Asian.
I presume that the Spambayes classifier tokenizes the incoming character stream according to an algorithm that depends heavily on clearly defined word markers (spaces and punctuation marks)
In the message body, we split on whitespace, and that's all. Punctuation is a non-SpamBayes concept in the body. Header tokenization makes many more assumptions, but the legal characters in email headers are constrained by standards in Anglo-centric ways.
that are largely absent, or at least less prominent, in Chinese. But if you try to tokenize the individual characters of written Chinese, you'll find that they're much more context sensitive than the words of an English sentence are. [etc]
The tokenizer pays no attention to any of that. The most common output when tokening an Asian language is a long string of synthesized "skip" tokens (splitting on whitespace yields long strings then, and strings of length greater than 12 get replaced by a synthesized skip token). In addition, the Outlook client has [Tokenizer] replace_nonascii_chars: True enabled, which replaces "control" and "high bit" characters each with a question mark before tokenization. So in the Outlook client, the primary output from parsing the body of an Asian language message is a bunch of synthesized "skip: ? N" tokens. That's great for weeding out Asian spam for European and American users, and sucks for everyone else. If everyone else wants something better, everyone else can volunteer to do the extensive research it takes to do something better <wink>.
participants (3)
-
Coe, Bob -
Skip Montanaro -
Tim Peters