[Spambayes] Central limit

Tim Peters tim.one@comcast.net
Mon, 30 Sep 2002 14:23:10 -0400


[Josiah Carlson]
> ...
> What if we were to split it up into financial spam, porn spam, etc.
> I would think that would even the playing field a bit.

ifile users in fact report that ifile does better at finding spam if they
set up multiple categories for various kinds of spam.  But despite what some
people may tell you here, the approach this project is taking, and Paul
Graham's approach, have very little in common with Bayesian classifiers.
The clearest connection is that Paul put the word "Bayesian" on his web
page, and then linked to an article doing a non-Bayesian probability
calculation (although the article didn't claim it was doing a Bayesian
calculation; it didn't mention Bayes at all).

> Then one could have two lists, the spam list (categories of spam) and
> the ham list (categories of ham).  We really only are concerned with
> ham or spam, but by doing a bit more work on our side, it could make
> the computer's job easier.

You can test this.  The error rates on my corpus are too low now to measure
an improvement reliably if one were to be made.  Other people here aren't
faring *that* well with this system, and I expect it's because their ham is
more varied than mine (which is composed of comp.lang.python traffic; that
newsgroup has historically had a very generous definition of "on topic", but
the majority of messages still seem to mention Python at least once, or at
least profane Guido's name <wink>).