I spent an enormous amount of time this weekend running tests against various changes -- a "1% inspiration, 99% perspiration" kind of thing. There are lots of words about the changes (both good and bad) in the comment blocks and checkin msgs. The biggest "conceptual" change is that I'm now using (but only using) the Subject and From lines from the headers (my earlier belief that the ham corpora Subject lines were too corrupted by Mailman decorations turned out to be wrong). Adding Subject lines gave a remarkably small improvement, btw. Most changes I tried either didn't matter, or hurt. Approximately 70 more blatant spams in the ham corpora were identified and replaced with (randomly selected) legitimate msgs.
The f-p rate is too low now to measure changes with confidence. Best guess I can make from the evidence is that it's below 0.05% now. The false negative rate has improved more, and there's still plenty of those (so it's still easy to be confident about whether changes do or don't help that).
Across all 20 runs (each training on 4000 ham + about 2750 spam, then predicting against a different set with the same number of each), these are the false positive and negative rates now (percentages; note that 0.025% is a single message in the f-p column; a single msg in the f-n column is about 0.036%):
f-p f-n 0.000 1.236 0.000 1.164 0.050 1.454 0.000 1.599 0.025 1.527 0.025 1.236 0.050 1.163 0.025 1.309 0.025 1.891 0.000 1.418 0.075 1.745 0.050 1.708 0.025 1.491 0.000 0.836 0.050 1.091 0.025 1.309 0.025 1.491 0.000 1.127 0.025 1.309 0.050 1.636
The aggregate number of unique f-p across all runs is down to 8. The aggregate number of unique f-n across all runs is 336.
The 8 ham messages for which at least one run claimed it was spam are attached. Note that I finally removed the "If AOL were a car" spam from the good corpus; while it may or may not be amusing, it *was* automated bulk email, even to the extent of including large blocks of random characters at the end. The message consisting almost entirely of quoting a Nigerian scam message looks like it would be a "false postitive" under any scheme worth using, but I left it in the good corpus (so it's still an f-p here), because it wasn't bulk email (the original msg was, but the reply was not).