[Python-Dev] The first trustworthy <wink> GBayes results

Thu, 29 Aug 2002 00:18:04 -0400

FYI, about counting multiple instances of a word multiple times, or only
once, when scoring.  Changing it to count words only once did fix the
specific false positive examples I mentioned.  However, across 20 test runs
(training on one of five pairs of corpora, and then for each such training
pair running predictions across the remaining four pairs), it was a mixed
bag.  On some runs it appeared to be a real improvement, on others a real
regression.  Overall, the results didn't support concluding it made a
significant difference to the false positive rate, but weakly supported
concluding that it increased the false negative rate.

That's very tentative -- I didn't stare at the actual misclassifications, I
just ran it while sleeping off a flu, then woke up and crunched the numbers.
This ignorant-of-MIME tokenization scheme is ridiculously bad for the false
negative rate anyway (an entire line of base64 or obfuscated
quoted-printable looks like a ham-favoring single "unknown word" to it), so
there are bigger fish to fry first.