[Python-Dev] The first trustworthy <wink> GBayes results
Tim Peters
tim.one@comcast.net
Wed, 28 Aug 2002 16:59:39 -0400
[Paul Graham]
> Don't count words multiple times, and you'll probably
> get fewer false positives. That's the main reason I
> don't do it-- because it magnifies the effect of some
> random word like water happening to have a big spam
> probability.
Yes, that makes sense, but I'm trained not to think <wink>. Experiment will
decide it (although I *expect* it's a good change, and counting multiple
occurrences was obviously a factor in several of the rare false positives).
If spam really is different, it should be different in several distinct
ways.
> (Incidentally, why so high? In my db it's only 0.3930784.) --pg
I expect it's because this tokenizer *only* split on whitespace.
Punctuation was left intact. So, e.g., on the Python discussion list stuff
like
The new approach blows it out of the water:
and
This is very deep water;
and
Then you'll take to Python like a duck takes to water!
are counted as "water:" and "water;" and "water!", not as "water".
The spam corpus is chock full o' "water", though:
+ Porn sites advertising water sports.
+ Assorted bottled water pitches.
+ Assorted "oxygenated water" pitches.
+ Claims of environmental friendliness explicated via stuff like
"no harmful chlorine to pollute the water or air!".
+ Pitches for weight-loss gimmicks emphasizing that you'll really
loss fat, not just reduce water retention.
+ Pitches for weight-loss gimmicks empphasizing that you'll reduce
water retention as well as lose fat.
+ One repeated bizarre analogy for how a breast enlargement cream
works in the way "a sponge absorbs water".
+ This revolutionary new flat garden hose will really cut your water
bills.
+ Ditto this miracle new laundry tablet lets you use a fraction of
the water needed by old-fashioned detergents.
+ Survivalist pitches often mention water in the same sentence as
air and medical care.
I got tired then <wink>.