[Python-Dev] The first trustworthy <wink> GBayes results
Sat, 31 Aug 2002 17:43:39 -0400
[Tim, to Paul Graham]
> I also noted earlier that FREE (all caps) is now one of the 15 words that
> most often makes it into the scorer's best-15 list, and cutting
> the legs off a clue like that is unattractive on the face of it. So I'm
> loathe to fold case unless experiment proves that's an improvement, and it
> just doesn't look likely to do so.
Those experiments have been run now. Folding case gave a slight but
significant improvement in the false negative rate. It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives: conference announcments are no longer flagged (for their VISIT
OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY). So, overall, I'm leaving
case-folding in. It does (of course) reduce the database size, and reduce
the amount of training data needed. I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).
Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win: it had no significant effect on the false positive
rate, but cut the false negative rate by a third. The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.