[Spambayes] Central limit
Anthony Baxter
anthony@interlink.com.au
Mon, 30 Sep 2002 23:58:54 +1000
>>> Rob Hooft wrote
> - It should somehow be possible to classify messages into any sumber of
> distinct groups using this trick. A new message can get scored a
> Z-score to describe the likelyhood that it is part of any of the
> groups, if all of these numbers are large, the test message does not
> belong to any class. I guess, e.g. that it should not be too
> difficult for the bayesian algorithms used here to judge whether
> E-mail I receive is for "work", "private" or "spam". What would take
> this really to the next generation would be an algorithm that can
> make the classification "ab initio" as a sort of clustering
> algorithm: e.g. something that would start with two of the most
> different messages in a single corpus, and add single messages to
> either of the two groups until it finds a message that has 2 large
> Z-scores. Then it starts a third group.
Interestingly, I was thinking about doing something like this for
attempting to guess the language of text - I was planning on trying
to use letter trigrams. I have a need for something like this in the
future - text to speech engines, even if they can do non-english,
require that you tell it the language it's reading (you should hear
what the american english dictionary does to german - it's not at
all pretty).
Anthony
--
Anthony Baxter <anthony@interlink.com.au>
It's never too late to have a happy childhood.