Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability.
Yes, that makes sense, but I'm trained not to think <wink>. Experiment will decide it (although I *expect* it's a good change, and counting multiple occurrences was obviously a factor in several of the rare false positives). If spam really is different, it should be different in several distinct ways.
(Incidentally, why so high? In my db it's only 0.3930784.) --pg
I expect it's because this tokenizer *only* split on whitespace. Punctuation was left intact. So, e.g., on the Python discussion list stuff like
The new approach blows it out of the water: and This is very deep water; and Then you'll take to Python like a duck takes to water!
are counted as "water:" and "water;" and "water!", not as "water".
The spam corpus is chock full o' "water", though:
+ Porn sites advertising water sports. + Assorted bottled water pitches. + Assorted "oxygenated water" pitches. + Claims of environmental friendliness explicated via stuff like "no harmful chlorine to pollute the water or air!". + Pitches for weight-loss gimmicks emphasizing that you'll really loss fat, not just reduce water retention. + Pitches for weight-loss gimmicks empphasizing that you'll reduce water retention as well as lose fat. + One repeated bizarre analogy for how a breast enlargement cream works in the way "a sponge absorbs water". + This revolutionary new flat garden hose will really cut your water bills. + Ditto this miracle new laundry tablet lets you use a fraction of the water needed by old-fashioned detergents. + Survivalist pitches often mention water in the same sentence as air and medical care.
I got tired then <wink>.