[Paul Graham]
I see, if you count the punctuation as part of the token, you end up with undersized-corpus effects. Esp if you are case-sensitive too. If I were you I'd map your input down into a narrower set of tokens, or you'll get too many errors. --pg
Possibly, but that's for experiment to decide (along with many other variations). The initial tokenization method was chosen merely for speed. Still, I looked at every false positive across 80,000 presumed non-spam test inputs, and posted the results earlier: it's hard to imagine that ignoring punctuation and/or case would have stopped any of them except for this one (which is darned hard to care about <wink>): """ HEY DUDEZ ! I WANT TO GET INTO THIS AUTOCODING THING. ANYONE KNOW WHERE I CAN GET SOME IBM 1401 WAREZ ? -- MULTICS-MAN """ prob = 0.999982095931 prob('AUTOCODING') = 0.2 prob('THING.') = 0.2 prob('DUDEZ') = 0.2 prob('ANYONE') = 0.884211 prob('GET') = 0.847334 prob('GET') = 0.847334 prob('HEY') = 0.2 prob('--') = 0.0974729 prob('KNOW') = 0.969697 prob('THIS') = 0.953191 prob('?') = 0.0490886 prob('WANT') = 0.99 prob('TO') = 0.988829 prob('CAN') = 0.884211 prob('WAREZ') = 0.2 I also noted earlier that FREE (all caps) is now one of the 15 words that most often makes it into the scorer's best-15 list, and cutting the legs off a clue like that is unattractive on the face of it. So I'm loathe to fold case unless experiment proves that's an improvement, and it just doesn't look likely to do so. For smaller corpora, some other conclusion may well be justified; but experimenting on smaller corpora isn't on my near-term agenda, so that will have to wait (we've got a specific application in mind right now for which the copora size I'm using is actually tiny -- python.org hosts some very high-volume mailing lists).