[Spambayes] test sets?
Tim Peters
tim.one@comcast.net
Mon, 09 Sep 2002 10:53:39 -0400
I see that most people here aren't on the spambayes-checkins list. You
should be aware of this change. If anyone thinks one of the headers in the
safe_headers set is prone to systematic bias in their test data, let me know
and I'll take it out.
-----Original Message-----
From: spambayes-checkins-bounces@python.org
[mailto:spambayes-checkins-bounces@python.org]On Behalf Of Tim Peters
Sent: Monday, September 09, 2002 12:56 AM
To: spambayes-checkins@python.org
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.9,1.10
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29522
Modified Files:
tokenizer.py
Log Message:
Pure win, from enabling Anthony's "count the mere # of various header
lines, case-sensitively" on a small subset of header lines. This
avoids all the header lines the union of Greg and Barry told me *might*
be artifacts of Mailman and/or BruceG's (the spam collector's)
email setup. It's an open question how much this may merely be
discriminating newsgroup traffic from non-newsgroup mail, but I also
left out what I thought were obvious newsgroupy headers (like References:).
The presence of X-Complaints-To happens to be a very strong discriminator
in my data, and accounts for redeeming 6 of the 14 previous false
positives.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.100 0.050 won -50.00%
0.025 0.000 won -100.00%
0.050 0.025 won -50.00%
0.000 0.000 tied
0.075 0.075 tied
0.050 0.025 won -50.00%
0.025 0.025 tied
0.025 0.000 won -100.00%
0.050 0.050 tied
0.050 0.000 won -100.00%
0.050 0.025 won -50.00%
0.000 0.000 tied
0.000 0.000 tied
0.075 0.050 won -33.33%
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.100 0.050 won -50.00%
won 9 times
tied 11 times
lost 0 times
total unique fp went from 14 to 8 won -42.86%
false negative percentages
0.291 0.255 won -12.37%
0.364 0.364 tied
0.254 0.254 tied
0.582 0.509 won -12.54%
0.545 0.436 won -20.00%
0.218 0.218 tied
0.218 0.182 won -16.51%
0.654 0.582 won -11.01%
0.364 0.327 won -10.16%
0.255 0.255 tied
0.400 0.254 won -36.50%
0.654 0.582 won -11.01%
0.618 0.545 won -11.81%
0.291 0.255 won -12.37%
0.291 0.291 tied
0.436 0.400 won -8.26%
0.436 0.291 won -33.26%
0.218 0.218 tied
0.255 0.218 won -14.51%
0.182 0.145 won -20.33%
won 14 times
tied 6 times
lost 0 times
total unique fn went from 101 to 89 won -11.88%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9
--- tokenizer.py 9 Sep 2002 04:56:12 -0000 1.10
***************
*** 745,748 ****
--- 745,770 ----
yield '.'.join(parts[:i])
+ # We're merely going to count the number of these, and case-sensitively.
+ safe_headers = Set("""
+ abuse-reports-to
+ date
+ errors-to
+ from
+ importance
+ in-reply-to
+ message-id
+ mime-version
+ organization
+ received
+ reply-to
+ return-path
+ subject
+ to
+ user-agent
+ x-abuse-info
+ x-complaints-to
+ x-face
+ """.split())
+
...