[Spambayes-checkins]
spambayes Options.py,1.5,1.6 bayes.ini,1.2,1.3 tokenizer.py,1.13,1.14
Tim Peters
tim_one@users.sourceforge.net
Tue, 10 Sep 2002 09:02:45 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30619
Modified Files:
Options.py bayes.ini tokenizer.py
Log Message:
Added option Tokenizer/count_all_header_lines. Defaults to False. You
can override by creating a bayescustomize.ini. When True, the
safe_headers option is ignored and Anthony's code to count *all* header
lines is used instead. This is almost certainly a Good Thing to do if
your ham and spam come from the same source, and almost certainly a
Bad Thing to do if they're from different sources (too many clues about
the source are likely to appear in the header-line counts).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** Options.py 10 Sep 2002 01:53:12 -0000 1.5
--- Options.py 10 Sep 2002 16:02:40 -0000 1.6
***************
*** 18,21 ****
--- 18,22 ----
'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
'safe_headers': ('get', lambda s: Set(s.split())),
+ 'count_all_header_lines': boolean_cracker,
},
'TestDriver': {'nbuckets': int_cracker,
***************
*** 28,32 ****
'show_histograms': boolean_cracker,
'show_best_discriminators': boolean_cracker,
! }
}
--- 29,33 ----
'show_histograms': boolean_cracker,
'show_best_discriminators': boolean_cracker,
! },
}
Index: bayes.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** bayes.ini 10 Sep 2002 00:06:37 -0000 1.2
--- bayes.ini 10 Sep 2002 16:02:41 -0000 1.3
***************
*** 5,14 ****
retain_pure_html_tags: False
! # tokenizer.Tokenizer.tokenize_headers() generates tokens just counting
! # the number of instances of the headers in this set, in a case-sensitive
! # way. Depending on data collection, some headers aren't safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one.
safe_headers: abuse-reports-to
date
--- 5,22 ----
retain_pure_html_tags: False
! # Generate tokens just counting the number of instances of each kind of
! # header line, in a case-sensitive way.
! #
! # Depending on data collection, some headers aren't safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one. In that case, set
! # count_all_header_lines to False, and adjust safe_headers instead.
!
! count_all_header_lines: False
!
! # Like count_all_header_lines, but restricted to headers in this list.
! # safe_headers is ignored when count_all_header_lines is true.
!
safe_headers: abuse-reports-to
date
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** tokenizer.py 9 Sep 2002 20:37:14 -0000 1.13
--- tokenizer.py 10 Sep 2002 16:02:41 -0000 1.14
***************
*** 839,870 ****
yield 'received:' + tok
! # XXX Following is a great idea due to Anthony Baxter. I can't use it
! # XXX on my test data because the header lines are so different between
! # XXX my ham and spam that it makes a large improvement for bogus
! # XXX reasons. So it's commented out. But it's clearly a good thing
! # XXX to do on "normal" data, and subsumes the Organization trick above
! # XXX in a much more general way, yet at comparable cost.
!
! # X-UIDL:
! # Anthony Baxter's idea. This has spamprob 0.99! The value
! # is clearly irrelevant, just the presence or absence matters.
! # However, it's extremely rare in my spam sets, so doesn't
! # have much value.
! #
! # As also suggested by Anthony, we can capture all such header
! # oddities just by generating tags for the count of how many
! # times each header field appears.
! ##x2n = {}
! ##for x in msg.keys():
! ## x2n[x] = x2n.get(x, 0) + 1
! ##for x in x2n.items():
! ## yield "header:%s:%d" % x
!
! # Do a "safe" approximation to that for now.
! safe_headers = options.safe_headers
x2n = {}
! for x in msg.keys():
! if x.lower() in safe_headers:
x2n[x] = x2n.get(x, 0) + 1
for x in x2n.items():
yield "header:%s:%d" % x
--- 839,859 ----
yield 'received:' + tok
! # As suggested by Anthony Baxter, merely counting the number of
! # header lines, and in a case-sensitive way, has really value.
! # For example, all-caps SUBJECT is a strong spam clue, while
! # X-Complaints-To a strong ham clue.
x2n = {}
! if options.count_all_header_lines:
! for x in msg.keys():
x2n[x] = x2n.get(x, 0) + 1
+ else:
+ # Do a "safe" approximation to that. When spam and ham are
+ # collected from different sources, the count of some header
+ # lines can be a too strong a discriminator for accidental
+ # reasons.
+ safe_headers = options.safe_headers
+ for x in msg.keys():
+ if x.lower() in safe_headers:
+ x2n[x] = x2n.get(x, 0) + 1
for x in x2n.items():
yield "header:%s:%d" % x