Re: RE: RE: [Python-Dev] The first trustworthy <wink> GBayes results
I see, if you count the punctuation as part of the token, you end up with undersized-corpus effects. Esp if you are case-sensitive too. If I were you I'd map your input down into a narrower set of tokens, or you'll get too many errors. --pg --Tim Peters wrote:
[Paul Graham]
Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability.
Yes, that makes sense, but I'm trained not to think <wink>. Experiment will decide it (although I *expect* it's a good change, and counting multiple occurrences was obviously a factor in several of the rare false positives). If spam really is different, it should be different in several distinct ways.
(Incidentally, why so high? In my db it's only 0.3930784.) --pg
I expect it's because this tokenizer *only* split on whitespace. Punctuation was left intact. So, e.g., on the Python discussion list stuff like
The new approach blows it out of the water: and This is very deep water; and Then you'll take to Python like a duck takes to water!
are counted as "water:" and "water;" and "water!", not as "water".
The spam corpus is chock full o' "water", though:
+ Porn sites advertising water sports. + Assorted bottled water pitches. + Assorted "oxygenated water" pitches. + Claims of environmental friendliness explicated via stuff like "no harmful chlorine to pollute the water or air!". + Pitches for weight-loss gimmicks emphasizing that you'll really loss fat, not just reduce water retention. + Pitches for weight-loss gimmicks empphasizing that you'll reduce water retention as well as lose fat. + One repeated bizarre analogy for how a breast enlargement cream works in the way "a sponge absorbs water". + This revolutionary new flat garden hose will really cut your water bills. + Ditto this miracle new laundry tablet lets you use a fraction of the water needed by old-fashioned detergents. + Survivalist pitches often mention water in the same sentence as air and medical care.
I got tired then <wink>.
[Paul Graham]
I see, if you count the punctuation as part of the token, you end up with undersized-corpus effects. Esp if you are case-sensitive too. If I were you I'd map your input down into a narrower set of tokens, or you'll get too many errors. --pg
Possibly, but that's for experiment to decide (along with many other variations). The initial tokenization method was chosen merely for speed. Still, I looked at every false positive across 80,000 presumed non-spam test inputs, and posted the results earlier: it's hard to imagine that ignoring punctuation and/or case would have stopped any of them except for this one (which is darned hard to care about <wink>): """ HEY DUDEZ ! I WANT TO GET INTO THIS AUTOCODING THING. ANYONE KNOW WHERE I CAN GET SOME IBM 1401 WAREZ ? -- MULTICS-MAN """ prob = 0.999982095931 prob('AUTOCODING') = 0.2 prob('THING.') = 0.2 prob('DUDEZ') = 0.2 prob('ANYONE') = 0.884211 prob('GET') = 0.847334 prob('GET') = 0.847334 prob('HEY') = 0.2 prob('--') = 0.0974729 prob('KNOW') = 0.969697 prob('THIS') = 0.953191 prob('?') = 0.0490886 prob('WANT') = 0.99 prob('TO') = 0.988829 prob('CAN') = 0.884211 prob('WAREZ') = 0.2 I also noted earlier that FREE (all caps) is now one of the 15 words that most often makes it into the scorer's best-15 list, and cutting the legs off a clue like that is unattractive on the face of it. So I'm loathe to fold case unless experiment proves that's an improvement, and it just doesn't look likely to do so. For smaller corpora, some other conclusion may well be justified; but experimenting on smaller corpora isn't on my near-term agenda, so that will have to wait (we've got a specific application in mind right now for which the copora size I'm using is actually tiny -- python.org hosts some very high-volume mailing lists).
[Tim, to Paul Graham]
... I also noted earlier that FREE (all caps) is now one of the 15 words that most often makes it into the scorer's best-15 list, and cutting the legs off a clue like that is unattractive on the face of it. So I'm loathe to fold case unless experiment proves that's an improvement, and it just doesn't look likely to do so.
Those experiments have been run now. Folding case gave a slight but significant improvement in the false negative rate. It had no effect on the false positive rate, but did change the *set* of messages flagged as false positives: conference announcments are no longer flagged (for their VISIT OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some highly off-topic messages do (e.g., talking about money is now indistinguishable from screaming about MONEY). So, overall, I'm leaving case-folding in. It does (of course) reduce the database size, and reduce the amount of training data needed. I have no idea what this does for corpora in languages other than English (for that matter, I don't even know what "fold case" *means* in other languages <wink>). Experiment also showed that boosting the "unknown word" probability from 0.2 to 0.5 was a pure win: it had no significant effect on the false positive rate, but cut the false negative rate by a third. The only change I've seen that had a bigger effect on reducing false negatives was adding special parsing and tagging for embedded URLs.
participants (2)
-
Paul Graham
-
Tim Peters