[Spambayes-checkins] spambayes Options.py,1.5,1.6 bayes.ini,1.2,1.3 tokenizer.py,1.13,1.14

Tim Peters tim_one@users.sourceforge.net
Tue, 10 Sep 2002 09:02:45 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30619

Modified Files:
	Options.py bayes.ini tokenizer.py 
Log Message:
Added option Tokenizer/count_all_header_lines.  Defaults to False.  You
can override by creating a bayescustomize.ini.  When True, the
safe_headers option is ignored and Anthony's code to count *all* header
lines is used instead.  This is almost certainly a Good Thing to do if
your ham and spam come from the same source, and almost certainly a
Bad Thing to do if they're from different sources (too many clues about
the source are likely to appear in the header-line counts).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** Options.py	10 Sep 2002 01:53:12 -0000	1.5
--- Options.py	10 Sep 2002 16:02:40 -0000	1.6
***************
*** 18,21 ****
--- 18,22 ----
      'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
                    'safe_headers': ('get', lambda s: Set(s.split())),
+                   'count_all_header_lines': boolean_cracker,
                   },
      'TestDriver': {'nbuckets': int_cracker,
***************
*** 28,32 ****
                     'show_histograms': boolean_cracker,
                     'show_best_discriminators': boolean_cracker,
!                   }
  }
  
--- 29,33 ----
                     'show_histograms': boolean_cracker,
                     'show_best_discriminators': boolean_cracker,
!                   },
  }
  

Index: bayes.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** bayes.ini	10 Sep 2002 00:06:37 -0000	1.2
--- bayes.ini	10 Sep 2002 16:02:41 -0000	1.3
***************
*** 5,14 ****
  retain_pure_html_tags: False
  
! # tokenizer.Tokenizer.tokenize_headers() generates tokens just counting
! # the number of instances of the headers in this set, in a case-sensitive
! # way.  Depending on data collection, some headers aren't safe to count.
  # For example, if ham is collected from a mailing list but spam from your
  # regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one.
  safe_headers: abuse-reports-to
      date
--- 5,22 ----
  retain_pure_html_tags: False
  
! # Generate tokens just counting the number of instances of each kind of
! # header line, in a case-sensitive way.
! #
! # Depending on data collection, some headers aren't safe to count.
  # For example, if ham is collected from a mailing list but spam from your
  # regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one.  In that case, set
! # count_all_header_lines to False, and adjust safe_headers instead.
! 
! count_all_header_lines: False
! 
! # Like count_all_header_lines, but restricted to headers in this list.
! # safe_headers is ignored when count_all_header_lines is true.
! 
  safe_headers: abuse-reports-to
      date

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** tokenizer.py	9 Sep 2002 20:37:14 -0000	1.13
--- tokenizer.py	10 Sep 2002 16:02:41 -0000	1.14
***************
*** 839,870 ****
                              yield 'received:' + tok
  
!         # XXX Following is a great idea due to Anthony Baxter.  I can't use it
!         # XXX on my test data because the header lines are so different between
!         # XXX my ham and spam that it makes a large improvement for bogus
!         # XXX reasons.  So it's commented out.  But it's clearly a good thing
!         # XXX to do on "normal" data, and subsumes the Organization trick above
!         # XXX in a much more general way, yet at comparable cost.
! 
!         # X-UIDL:
!         # Anthony Baxter's idea.  This has spamprob 0.99!  The value
!         # is clearly irrelevant, just the presence or absence matters.
!         # However, it's extremely rare in my spam sets, so doesn't
!         # have much value.
!         #
!         # As also suggested by Anthony, we can capture all such header
!         # oddities just by generating tags for the count of how many
!         # times each header field appears.
!         ##x2n = {}
!         ##for x in msg.keys():
!         ##    x2n[x] = x2n.get(x, 0) + 1
!         ##for x in x2n.items():
!         ##    yield "header:%s:%d" % x
! 
!         # Do a "safe" approximation to that for now.
!         safe_headers = options.safe_headers
          x2n = {}
!         for x in msg.keys():
!             if x.lower() in safe_headers:
                  x2n[x] = x2n.get(x, 0) + 1
          for x in x2n.items():
              yield "header:%s:%d" % x
--- 839,859 ----
                              yield 'received:' + tok
  
!         # As suggested by Anthony Baxter, merely counting the number of
!         # header lines, and in a case-sensitive way, has really value.
!         # For example, all-caps SUBJECT is a strong spam clue, while
!         # X-Complaints-To a strong ham clue.
          x2n = {}
!         if options.count_all_header_lines:
!             for x in msg.keys():
                  x2n[x] = x2n.get(x, 0) + 1
+         else:
+             # Do a "safe" approximation to that.  When spam and ham are
+             # collected from different sources, the count of some header
+             # lines can be a too strong a discriminator for accidental
+             # reasons.
+             safe_headers = options.safe_headers
+             for x in msg.keys():
+                 if x.lower() in safe_headers:
+                     x2n[x] = x2n.get(x, 0) + 1
          for x in x2n.items():
              yield "header:%s:%d" % x