[Python-Dev] GBayes design

Tim Peters spambayes@python.org
Thu, 05 Sep 2002 14:30:03 -0400


[Followups directed to spambayes@python.org
 http://mail.python.org/mailman-21/listinfo/spambayes
]

[Raymond Hettinger]
> Is it too late to challenge a core design decision?

Never too late, but somebody has to do real work to prove that a change is
justified.  Plausible ideas are cheaper than dirt, alas.

> Instead of multiplying probablities, use fuzzy logic methods.
> Classify the indicators into damning, strong, weak, neautral, ...

Think about how that differs from 0.99, 0.80, 0.20 and 0.50.  Does it?

> After counting the number of indicators in each class, make
> a spam/ham decision that can be easily tweaked.  This would
> make it easy to implement variations of Tim's recent clear
> win, where additional indicators are gathered until the
> balance shifts sharply to one side.
>
> Some other advantages are:
> -- easily interpreted score vectors (6 damning, 7 strong, 4 weak, ... )

I've seen people see the current prob("TV") = 0.99 style cold and pick it up
at once.  With character n-grams I think it's frustrating, but word-like
tokenization gives easily recognized clues.

> -- avoids mathematical issues with indicators not being independent

How do you know this?

> -- allows the addition of non-token based indicators.  for instance,
>     a preponderance of caps would be a weak indicator.  the presence
>     of caps separated by spaces would be a strong indicator.

As far as the current classifier is concerned, "a token" is any Python
object usable as a dict key.  There are already several ways in which the
current tokenization scheme in timtest.py uses strings to *represent*
non-textual indicators.  For example, if the headers lack an Organization
line, a 'bool:noorg' "token" is generated.  For large blobs of text that get
skipped, a token is generated that records both the first character in that
blob and the number of bytes skipped (chopped to the nearest multiple of
10).  And so on -- you can inject anything you like into the scheme,
including stuff like

    "number of caps separated by spaces: more than 10"

(BTW, I happen to know that this particular "clue" acts to block relevant
conference announcements, not just spam)

I got some interesting results by injecting a crude characters/word
statistic:

    yield "cpw:%.1g" % (float(len(text)) / len(text.split()))

There are certain values of that statistic that turned out to be
killer-strong spam indicators, but there's a potential problem I've
mentioned before:  if you have an unbounded number of free parameters you
can fiddle, you can train a system to fit any given dataset exactly.  That's
in part why replication of results by others is necessary to make schemes
like this superb (I can only make one merely excellent on my own <wink>).

> -- the decision logic would be more intuitive
> -- avoids the issue of having equal amounts of spam and ham in
>     the sample

It's not clear that this matters; some results of preliminary experiments
are written up in the code comments.  The way Graham computes P(Spam | Word)
is via ratios, *as if* there were an equal number of each; and that's
consistent with the other bogus <wink> equality assumption in the scorer.  I
haven't yet changed all these guys at the same time to take P(Spam) and
P(Ham) into account.

BTW, note that all the results I've reported had a ham/spam training ratio
of 4000/2750.  I left that non-unity on purpose.

> The core concept would stay the same -- it's really just a shift from
> continuous to discrete.

Let us know how it turns out <wink>.