
FYI, about counting multiple instances of a word multiple times, or only once, when scoring. Changing it to count words only once did fix the specific false positive examples I mentioned. However, across 20 test runs (training on one of five pairs of corpora, and then for each such training pair running predictions across the remaining four pairs), it was a mixed bag. On some runs it appeared to be a real improvement, on others a real regression. Overall, the results didn't support concluding it made a significant difference to the false positive rate, but weakly supported concluding that it increased the false negative rate. That's very tentative -- I didn't stare at the actual misclassifications, I just ran it while sleeping off a flu, then woke up and crunched the numbers. This ignorant-of-MIME tokenization scheme is ridiculously bad for the false negative rate anyway (an entire line of base64 or obfuscated quoted-printable looks like a ham-favoring single "unknown word" to it), so there are bigger fish to fry first.