[Python-Dev] Getting started with GBayes testing
Tim Peters
tim.one@comcast.net
Wed, 04 Sep 2002 21:34:21 -0400
Guido addressed most points, so I'll just cover a few:
[Brad Clements]
> ...
> I'd like to replicate Tim's test rig so I can compare my results
> with existing ones. My spam isn't in mbox format, but I can convert it.
Mine isn't either <wink>. Barry gave me mboxes, but the spam corpus I got
off the web had one spam per file, and it only took two days of extreme pain
to realize that one msg per file is enormously easier to work with when
testing: you want to split these at random into random collections, you may
need to replace some at random when testing reveals spam mistakenly called
ham (and vice versa), etc -- even pasting examples into email is much easier
when it's one msg per file (and the test driver makes it easy to print a
msg's file path).
My test driver and tokenizer are checked in (timtest.py), and also a little
utility or two. The directory structure under my spambayes directory looks
like so:
Data/
Spam/
Set1/ (contains 2750 spam .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
Ham/
Set1/ (contains 4000 ham .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
reservoir/ (contains "backup ham")
If you use the same names and structure, huge mounds of the tedious testing
code will work as-is. The more Set directories the merrier, although you'll
hit a point of diminishing returns if you exceed 10. The "reservoir"
directory contains a few thousand other random hams. When a ham is found
that's really spam, I delete it, and then the rebal.py utility moves in a
message at random from the reservoir to replace it. If I had it to do over
again, I think I'd move such spam into a Spam set (chosen at random),
instead of deleting it.
> I'm particularly intersted in how to allow html only messages
> (reduce false positives). I'm getting a lot of personal mail in that
> format, unfortunately.
It will learn about that -- not a problem. It's a problem in *my* tests
because HTML mail is so strongly hated on tech lists, but newbies use it
there anyway, and it would be horrid to block newbies just because they're
normal people who enjoy creating visually attractive messages <0.9 wink>.
Read the "What about HTML?" section in timtest.py.
You may also with to remove the guard from
if part.get_content_type() == "text/plain":
text = html_re.sub(' ', text)
in tokenize(). Once you have a good test setup, you can try it both ways,
and the data will tell you which way works best for your normal mix.
Details of runs both ways on my c.l.py corpora are given in the "What about
HTML?" section mentioned before, and even there stripping HTML decorations
out of HTML-only messages had an insignificant effect on the f-p rate. It
increased the f-n rate, though, and precisely because HTML messages are so
very rare on c.l.py that they're *almost* certainly spam.