[Spambayes] Central limit
Matt Sergeant
msergeant@startechgroup.co.uk
Mon, 30 Sep 2002 18:29:26 +0100
Josiah Carlson wrote:
> Then Rob Hooft sent the email with this:
>
>
>> - It should somehow be possible to classify messages into any sumber of
>> distinct groups using this trick. A new message can get scored a
>> Z-score to describe the likelyhood that it is part of any of the
>> groups, if all of these numbers are large, the test message does not
>> belong to any class. I guess, e.g. that it should not be too
>> difficult for the bayesian algorithms used here to judge whether
>> E-mail I receive is for "work", "private" or "spam". What would take
>> this really to the next generation would be an algorithm that can
>> make the classification "ab initio" as a sort of clustering
>> algorithm: e.g. something that would start with two of the most
>> different messages in a single corpus, and add single messages to
>> either of the two groups until it finds a message that has 2 large
>> Z-scores. Then it starts a third group.
>
>
> I've heard that popfile does something like that, where you can have as
> many categories as you want. I do not know how they do it
> however...maybe Matt Sergeant (the new perl guy) can check it out and
> tell us what's going on.
It's just an extension of what you're already doing. Imagine the
equations extended to multiple categories. At the moment you are asking
it two questions: Is this email spam, and is this email ham [and then
some fancy equations kick in to decide that given the two probabilities
which is most likely the truth]. Just extend that to more questions. My
original implementation of bayesian probability just returned multiple
categories that the email happened to be in, although it only trained on
two categories.
> It is entirely possible that by categorizing our email into only two
> distinct categories, spam and ham, that we are asking the statistics to
> generalize too much. Imagine if instead of spam and ham, you had spam
> and spambayes. Wouldn't you think it would do much better in
> classifying incoming email into one of those two? Or even a mother and
> girlfriend (assuming all email is either from your mother or girlfriend,
> or spam and spambayes for the earlier comparison). Considering the
> variety of email I receive (and everyone else no doubt), I classify it
> into multiple folders (each person I email with gets their own folder,
> each mailing list gets another folder, etc.). In fact, I have over 50
> folders for email.
>
> I think that asking a piece of software, regardless of how much
> potential it has mathematically, is foolish. "Hey software, tell me if
> this email should go into ANY ONE of my 50 folders, or if it should be
> thrown away as spam." That's a big generalization.
Yes and no. Asking it to get more fine-grained than spam vs ham simply
takes away information from the system - you get fewer tokens in each
category and it makes it harder for the system to decide. You possibly
get less FN's, but you'll get more FP's I think.
Matt.