[Spambayes] Central limit

Anthony Baxter anthony@interlink.com.au
Mon, 30 Sep 2002 23:58:54 +1000


>>> Rob Hooft wrote
>   - It should somehow be possible to classify messages into any sumber of
>     distinct groups using this trick. A new message can get scored a
>     Z-score to describe the likelyhood that it is part of any of the
>     groups, if all of these numbers are large, the test message does not
>     belong to any class. I guess, e.g. that it should not be too
>     difficult for the bayesian algorithms used here to judge whether
>     E-mail I receive is for "work", "private" or "spam". What would take
>     this really to the next generation would be an algorithm that can
>     make the classification "ab initio" as a sort of clustering
>     algorithm: e.g. something that would start with two of the most
>     different messages in a single corpus, and add single messages to
>     either of the two groups until it finds a message that has 2 large
>     Z-scores. Then it starts a third group.

Interestingly, I was thinking about doing something like this for 
attempting to guess the language of text - I was planning on trying 
to use letter trigrams. I have a need for something like this in the 
future - text to speech engines, even if they can do non-english,
require that you tell it the language it's reading (you should hear
what the american english dictionary does to german - it's not at 
all pretty).

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.