RE: RE: RE: [Python-Dev] The first trustworthy <wink> GBayes results

28 Aug 2002

      [Paul Graham]
...
I see, if you count the punctuation as part of the
token, you end up with undersized-corpus effects.
Esp if you are case-sensitive too.  If I were you
I'd map your input down into a narrower set of tokens,
or you'll get too many errors.  --pg
Possibly, but that's for experiment to decide (along with many other
variations).  The initial tokenization method was chosen merely for speed.
Still, I looked at every false positive across 80,000 presumed non-spam test
inputs, and posted the results earlier:  it's hard to imagine that ignoring
punctuation and/or case would have stopped any of them except for this one
(which is darned hard to care about <wink>):

"""
HEY DUDEZ !
 I WANT TO GET INTO THIS AUTOCODING THING.
 ANYONE KNOW WHERE I CAN GET SOME IBM 1401 WAREZ ?
-- MULTICS-MAN
"""

prob = 0.999982095931
prob('AUTOCODING') = 0.2
prob('THING.') = 0.2
prob('DUDEZ') = 0.2
prob('ANYONE') = 0.884211
prob('GET') = 0.847334
prob('GET') = 0.847334
prob('HEY') = 0.2
prob('--') = 0.0974729
prob('KNOW') = 0.969697
prob('THIS') = 0.953191
prob('?') = 0.0490886
prob('WANT') = 0.99
prob('TO') = 0.988829
prob('CAN') = 0.884211
prob('WAREZ') = 0.2

I also noted earlier that FREE (all caps) is now one of the 15 words that
most often makes it into the scorer's best-15 list, and cutting the legs off
a clue like that is unattractive on the face of it.  So I'm loathe to fold
case unless experiment proves that's an improvement, and it just doesn't
look likely to do so.

For smaller corpora, some other conclusion may well be justified; but
experimenting on smaller corpora isn't on my near-term agenda, so that will
have to wait (we've got a specific application in mind right now for which
the copora size I'm using is actually tiny -- python.org hosts some very
high-volume mailing lists).

RE: RE: RE: [Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters