[Spambayes] test sets?

Mon, 09 Sep 2002 10:53:39 -0400

I see that most people here aren't on the spambayes-checkins list.  You
should be aware of this change.  If anyone thinks one of the headers in the
safe_headers set is prone to systematic bias in their test data, let me know
and I'll take it out.

-----Original Message-----
From: spambayes-checkins-bounces@python.org
[mailto:spambayes-checkins-bounces@python.org]On Behalf Of Tim Peters
Sent: Monday, September 09, 2002 12:56 AM
To: spambayes-checkins@python.org
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.9,1.10

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29522

Modified Files:
	tokenizer.py
Log Message:
Pure win, from enabling Anthony's "count the mere # of various header
lines, case-sensitively" on a small subset of header lines.  This
avoids all the header lines the union of Greg and Barry told me *might*
be artifacts of Mailman and/or BruceG's (the spam collector's)
email setup.  It's an open question how much this may merely be
discriminating newsgroup traffic from non-newsgroup mail, but I also
left out what I thought were obvious newsgroupy headers (like References:).
The presence of X-Complaints-To happens to be a very strong discriminator
in my data, and accounts for redeeming 6 of the 14 previous false
positives.

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.050  won    -50.00%
    0.025  0.000  won   -100.00%
    0.050  0.025  won    -50.00%
    0.000  0.000  tied
    0.075  0.075  tied
    0.050  0.025  won    -50.00%
    0.025  0.025  tied
    0.025  0.000  won   -100.00%
    0.050  0.050  tied
    0.050  0.000  won   -100.00%
    0.050  0.025  won    -50.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.075  0.050  won    -33.33%
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.100  0.050  won    -50.00%

won   9 times
tied 11 times
lost  0 times

total unique fp went from 14 to 8 won    -42.86%

false negative percentages
    0.291  0.255  won    -12.37%
    0.364  0.364  tied
    0.254  0.254  tied
    0.582  0.509  won    -12.54%
    0.545  0.436  won    -20.00%
    0.218  0.218  tied
    0.218  0.182  won    -16.51%
    0.654  0.582  won    -11.01%
    0.364  0.327  won    -10.16%
    0.255  0.255  tied
    0.400  0.254  won    -36.50%
    0.654  0.582  won    -11.01%
    0.618  0.545  won    -11.81%
    0.291  0.255  won    -12.37%
    0.291  0.291  tied
    0.436  0.400  won     -8.26%
    0.436  0.291  won    -33.26%
    0.218  0.218  tied
    0.255  0.218  won    -14.51%
    0.182  0.145  won    -20.33%

won  14 times
tied  6 times
lost  0 times

total unique fn went from 101 to 89 won    -11.88%

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** tokenizer.py	8 Sep 2002 23:48:50 -0000	1.9
--- tokenizer.py	9 Sep 2002 04:56:12 -0000	1.10
***************
*** 745,748 ****
--- 745,770 ----
          yield '.'.join(parts[:i])

+ # We're merely going to count the number of these, and case-sensitively.
+ safe_headers = Set("""
+     abuse-reports-to
+     date
+     errors-to
+     from
+     importance
+     in-reply-to
+     message-id
+     mime-version
+     organization
+     received
+     reply-to
+     return-path
+     subject
+     to
+     user-agent
+     x-abuse-info
+     x-complaints-to
+     x-face
+ """.split())
+

...