[Python-Dev] The first trustworthy <wink> GBayes results

Paul Graham pg@archub.org
28 Aug 2002 21:04:46 -0000


I see, if you count the punctuation as part of the
token, you end up with undersized-corpus effects.
Esp if you are case-sensitive too.  If I were you
I'd map your input down into a narrower set of tokens,
or you'll get too many errors.  --pg

--Tim Peters wrote:
> [Paul Graham]
> > Don't count words multiple times, and you'll probably
> > get fewer false positives.  That's the main reason I
> > don't do it-- because it magnifies the effect of some
> > random word like water happening to have a big spam
> > probability.
> 
> Yes, that makes sense, but I'm trained not to think <wink>.  Experiment will
> decide it (although I *expect* it's a good change, and counting multiple
> occurrences was obviously a factor in several of the rare false positives).
> If spam really is different, it should be different in several distinct
> ways.
> 
> > (Incidentally, why so high?  In my db it's  only 0.3930784.)  --pg
> 
> I expect it's because this tokenizer *only* split on whitespace.
> Punctuation was left intact.  So, e.g., on the Python discussion list stuff
> like
> 
>     The new approach blows it out of the water:
> and
>     This is very deep water;
> and
>     Then you'll take to Python like a duck takes to water!
> 
> are counted as "water:" and "water;" and "water!", not as "water".
> 
> The spam corpus is chock full o' "water", though:
> 
> + Porn sites advertising water sports.
> + Assorted bottled water pitches.
> + Assorted "oxygenated water" pitches.
> + Claims of environmental friendliness explicated via stuff like
>   "no harmful chlorine to pollute the water or air!".
> + Pitches for weight-loss gimmicks emphasizing that you'll really
>   loss fat, not just reduce water retention.
> + Pitches for weight-loss gimmicks empphasizing that you'll reduce
>   water retention as well as lose fat.
> + One repeated bizarre analogy for how a breast enlargement cream
>   works in the way "a sponge absorbs water".
> + This revolutionary new flat garden hose will really cut your water
>   bills.
> + Ditto this miracle new laundry tablet lets you use a fraction of
>   the water needed by old-fashioned detergents.
> + Survivalist pitches often mention water in the same sentence as
>   air and medical care.
> 
> I got tired then <wink>.
>