
Guido addressed most points, so I'll just cover a few: [Brad Clements]
... I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it.
Mine isn't either <wink>. Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg per file is enormously easier to work with when testing: you want to split these at random into random collections, you may need to replace some at random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg per file (and the test driver makes it easy to print a msg's file path). My test driver and tokenizer are checked in (timtest.py), and also a little utility or two. The directory structure under my spambayes directory looks like so: Data/ Spam/ Set1/ (contains 2750 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" Ham/ Set1/ (contains 4000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" reservoir/ (contains "backup ham") If you use the same names and structure, huge mounds of the tedious testing code will work as-is. The more Set directories the merrier, although you'll hit a point of diminishing returns if you exceed 10. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's really spam, I delete it, and then the rebal.py utility moves in a message at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it.
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
It will learn about that -- not a problem. It's a problem in *my* tests because HTML mail is so strongly hated on tech lists, but newbies use it there anyway, and it would be horrid to block newbies just because they're normal people who enjoy creating visually attractive messages <0.9 wink>. Read the "What about HTML?" section in timtest.py. You may also with to remove the guard from if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) in tokenize(). Once you have a good test setup, you can try it both ways, and the data will tell you which way works best for your normal mix. Details of runs both ways on my c.l.py corpora are given in the "What about HTML?" section mentioned before, and even there stripping HTML decorations out of HTML-only messages had an insignificant effect on the f-p rate. It increased the f-n rate, though, and precisely because HTML messages are so very rare on c.l.py that they're *almost* certainly spam.