[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Sat, 31 Aug 2002 17:43:39 -0400


[Tim, to Paul Graham]
> ...
> I also noted earlier that FREE (all caps) is now one of the 15 words that
> most often makes it into the scorer's best-15 list, and cutting
> the legs off a clue like that is unattractive on the face of it.  So I'm
> loathe to fold case unless experiment proves that's an improvement, and it
> just doesn't look likely to do so.

Those experiments have been run now.  Folding case gave a slight but
significant improvement in the false negative rate.  It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives:  conference announcments are no longer flagged (for their VISIT
OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY).  So, overall, I'm leaving
case-folding in.  It does (of course) reduce the database size, and reduce
the amount of training data needed.  I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).

Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win:  it had no significant effect on the false positive
rate, but cut the false negative rate by a third.  The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.