[Spambayes] Something to test

Tim Peters tim.one@comcast.net
Sun Nov 3 08:49:04 2002

This little patch arranges to create "noheader:HEADERNAME" tokens for
headers in options.safe_headers that *don't* appear in a msg's headers.  On
my fat c.l.py test it's a small theoretical improvement:  best-cost falls
from $26.80 to $22.00, by knocking down the score of the second-worst
hopeless FP just enough so that redeeming it *could* be traded away for an
increase in the Unsure rate.  That's not realistic, though (the spam_cutoff
value needed to redeem that FP is no longer insane, but is still
*unreasonably* high).

I'm keener on it because it eliminated a few difficult FP without changing
cutoffs, in three smaller tests on different test data.  I haven't run a
test where it hurt yet, and it has helped several times.

This captures the useful (in my data) part of what Anthony's tokenization of
Reply-To accomplished, without needing to tokenize the Reply-To content (the
thing that helped me there was that tokenizing Reply-To inadvertently
generated a token for the *absence* of a Reply-To header, and that's a ham
clue in my data, provided that the classifier can see it; one effect of the
patch is to generate a "noheader:reply-to" token when no Reply-To is found
in the headers; other effects include that the lack of an Organization
header becomes a spam clue in my data; sometimes more than one of these
coooperate to help push a difficult case to "the right side" of a cutoff).

Index: tokenizer.py
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.60
diff -c -r1.60 tokenizer.py
*** tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
--- tokenizer.py        3 Nov 2002 08:31:44 -0000
*** 1178,1183 ****
--- 1178,1185 ----
                      x2n[x] = x2n.get(x, 0) + 1
          for x in x2n.items():
              yield "header:%s:%d" % x
+         for x in options.safe_headers - Set([k.lower() for k in x2n]):
+             yield "noheader:" + x

      def tokenize_body(self, msg, maxword=options.skip_max_word_size):
          """Generate a stream of tokens from an email Message.