[Spambayes] Central limit

Matt Sergeant msergeant@startechgroup.co.uk
Mon, 30 Sep 2002 18:29:26 +0100


Josiah Carlson wrote:
> Then Rob Hooft sent the email with this:
> 
> 
>>  - It should somehow be possible to classify messages into any sumber of
>>    distinct groups using this trick. A new message can get scored a
>>    Z-score to describe the likelyhood that it is part of any of the
>>    groups, if all of these numbers are large, the test message does not
>>    belong to any class. I guess, e.g. that it should not be too
>>    difficult for the bayesian algorithms used here to judge whether
>>    E-mail I receive is for "work", "private" or "spam". What would take
>>    this really to the next generation would be an algorithm that can
>>    make the classification "ab initio" as a sort of clustering
>>    algorithm: e.g. something that would start with two of the most
>>    different messages in a single corpus, and add single messages to
>>    either of the two groups until it finds a message that has 2 large
>>    Z-scores. Then it starts a third group.
> 
> 
> I've heard that popfile does something like that, where you can have as
> many categories as you want.  I do not know how they do it
> however...maybe Matt Sergeant (the new perl guy) can check it out and
> tell us what's going on.

It's just an extension of what you're already doing. Imagine the 
equations extended to multiple categories. At the moment you are asking 
it two questions: Is this email spam, and is this email ham [and then 
some fancy equations kick in to decide that given the two probabilities 
which is most likely the truth]. Just extend that to more questions. My 
original implementation of bayesian probability just returned multiple 
categories that the email happened to be in, although it only trained on 
two categories.

> It is entirely possible that by categorizing our email into only two
> distinct categories, spam and ham, that we are asking the statistics to
> generalize too much.  Imagine if instead of spam and ham, you had spam
> and spambayes.  Wouldn't you think it would do much better in
> classifying incoming email into one of those two?  Or even a mother and
> girlfriend (assuming all email is either from your mother or girlfriend,
> or spam and spambayes for the earlier comparison). Considering the
> variety of email I receive (and everyone else no doubt), I classify it
> into multiple folders (each person I email with gets their own folder,
> each mailing list gets another folder, etc.).  In fact, I have over 50
> folders for email.
> 
> I think that asking a piece of software, regardless of how much
> potential it has mathematically, is foolish.  "Hey software, tell me if
> this email should go into ANY ONE of my 50 folders, or if it should be
> thrown away as spam."  That's a big generalization.

Yes and no. Asking it to get more fine-grained than spam vs ham simply 
takes away information from the system - you get fewer tokens in each 
category and it makes it harder for the system to decide. You possibly 
get less FN's, but you'll get more FP's I think.

Matt.