[Tim, to Paul Graham]
... I also noted earlier that FREE (all caps) is now one of the 15 words that most often makes it into the scorer's best-15 list, and cutting the legs off a clue like that is unattractive on the face of it. So I'm loathe to fold case unless experiment proves that's an improvement, and it just doesn't look likely to do so.
Those experiments have been run now. Folding case gave a slight but significant improvement in the false negative rate. It had no effect on the false positive rate, but did change the *set* of messages flagged as false positives: conference announcments are no longer flagged (for their VISIT OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some highly off-topic messages do (e.g., talking about money is now indistinguishable from screaming about MONEY). So, overall, I'm leaving case-folding in. It does (of course) reduce the database size, and reduce the amount of training data needed. I have no idea what this does for corpora in languages other than English (for that matter, I don't even know what "fold case" *means* in other languages <wink>). Experiment also showed that boosting the "unknown word" probability from 0.2 to 0.5 was a pure win: it had no significant effect on the false positive rate, but cut the false negative rate by a third. The only change I've seen that had a bigger effect on reducing false negatives was adding special parsing and tagging for embedded URLs.