RE: RE: RE: [Python-Dev] The first trustworthy <wink> GBayes results

1 Sep 2002

      [Tim, to Paul Graham]
...
...
I also noted earlier that FREE (all caps) is now one of the 15 words that
most often makes it into the scorer's best-15 list, and cutting
the legs off a clue like that is unattractive on the face of it.  So I'm
loathe to fold case unless experiment proves that's an improvement, and it
just doesn't look likely to do so.
Those experiments have been run now.  Folding case gave a slight but
significant improvement in the false negative rate.  It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives:  conference announcments are no longer flagged (for their VISIT
OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY).  So, overall, I'm leaving
case-folding in.  It does (of course) reduce the database size, and reduce
the amount of training data needed.  I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).

Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win:  it had no significant effect on the false positive
rate, but cut the false negative rate by a third.  The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.

RE: RE: RE: [Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters