[Spambayes] RE: Need more training messages

Skip Montanaro skip at pobox.com
Tue Sep 30 13:10:53 EDT 2003

    >> I agree those are the two cases (<wink>), but don't agree with your
    >> conclusions.  If we distribute SpamBayes with a default db that
    >> performs miserably on Asian ham, we're not likely to win a lot of
    >> support.  I would prefer that the default database process what
    >> current users normally encounter in a reasonable way. ...

    Bob> I think I detect an a priori confidence that the same version of
    Bob> the Spambayes classifier, if properly trained, can work effectively
    Bob> on both European and Asian languages. I wonder if that confidence
    Bob> isn't unduly optimistic. For example, ...

You're reading too much into my hen scratches.  But why (possibly)
needlessly prejudice a future segment of our population?

    Bob> I presume that the Spambayes classifier tokenizes the incoming
    Bob> character stream according to an algorithm that depends heavily on
    Bob> clearly defined word markers (spaces and punctuation marks) that
    Bob> are largely absent, or at least less prominent, in Chinese. 


    Bob> insurmountable, of course, but I think it casts doubt on the "One
    Bob> size fits all" approach.

I don't believe I suggested that.

More likely than J. Random Yoshi in Tokyo picking up SpamBayes is a
dual-language person (a student or green-card holder) in the US or Europe
trying it out.  That person is likely to get ham and spam in both European
and Asian character sets.  I'd like for their Asian ham to suddenly not all
wind up in their spam folder because you and I can't read Chinese.  It's
quite possible that SpamBayes will fall flat on its face distinguishing
Asian ham and spam anyway.  I'd prefer that all be "unsure" to start with
and let the user try training the different classes of mail.

There are a couple native Chinese speakers in my group here at Northwestern.
I should ask them if get any email written using Chinese character sets and
would like to try out SB.


