[Spambayes] More proposed hammie changes: use Options
Rob Hooft
rob@hooft.net
Sun Oct 27 08:11:53 2002
This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Attached are some more changes I'd like to propose to make to hammie:
* Add -D option to reverse the -d option
* Make the default use of pickle/database configurable
* Add a showclue-limit to limit the clues added to the
Hammie-Disposition header. I found the header becoming a bit
large for many of my messages. This option can be used to make
it show only the strongest clues either way.
* Add a section [Hammie] to the configuration file to take all
these hammie configurations such that hammie doesn't always need
to be run with half a dozen of options to work (I always forget one
if I'm trying it interactively).
Furthermore, the patch changes a lot of the ' and " signs in the default
string in Options.py such that the parser in emacs/python-mode.el is now
happy with it.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
---------------------- multipart/mixed attachment
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.59
diff -u -r1.59 Options.py
--- Options.py 27 Oct 2002 05:26:01 -0000 1.59
+++ Options.py 27 Oct 2002 08:02:09 -0000
@@ -48,7 +48,7 @@
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
#
-# Depending on data collection, some headers aren't safe to count.
+# Depending on data collection, some headers are not safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
# very strong ham clue, but a bogus one. In that case, set
@@ -150,7 +150,7 @@
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
-# called "I'm not sure" -- the middle ground.
+# called 'I am not sure' -- the middle ground.
#
# Note that cvcost.py does a similar analysis.
#
@@ -169,7 +169,7 @@
# Display spam when
# show_spam_lo <= spamprob <= show_spam_hi
-# and likewise for ham. The defaults here don't show anything.
+# and likewise for ham. The defaults here do not show anything.
show_spam_lo: 1.0
show_spam_hi: 0.0
show_ham_lo: 1.0
@@ -179,8 +179,8 @@
show_false_negatives: False
show_unsure: False
-# Near the end of Driver.test(), you can get a listing of the 'best
-# discriminators' in the words from the training sets. These are the
+# Near the end of Driver.test(), you can get a listing of the best
+# discriminators in the words from the training sets. These are the
# words whose WordInfo.killcount values are highest, meaning they most
# often were among the most extreme clues spamprob() found. The number
# of best discriminators to show is given by show_best_discriminators;
@@ -196,7 +196,7 @@
# pickle_basename, the extension is .pik, and increasing integers are
# appended to pickle_basename. By default (if save_trained_pickles is
# true), the filenames are class1.pik, class2.pik, ... If a file of that
-# name already exists, it's overwritten. pickle_basename is ignored when
+# name already exists, it is overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
# if save_histogram_pickles is true, Driver.train() saves a binary
@@ -218,9 +218,9 @@
# training each on N-1 sets, and the predicting against the set not trained
# on. By default, it does this in a clever way, learning *and* unlearning
# sets as it goes along, so that it never needs to train on N-1 sets in one
-# gulp after the first time. Setting this option true forces "one gulp
-# from-scratch" training every time. There used to be a set of combining
-# schemes that needed this, but now it's just in case you're paranoid <wink>.
+# gulp after the first time. Setting this option true forces ''one gulp
+# from-scratch'' training every time. There used to be a set of combining
+# schemes that needed this, but now it is just in case you are paranoid <wink>.
build_each_classifier_from_scratch: False
[Classifier]
@@ -230,15 +230,15 @@
max_discriminators: 150
# These two control the prior assumption about word probabilities.
-# "x" is essentially the probability given to a word that's never been
+# "x" is essentially the probability given to a word that has never been
# seen before. Nobody has reported an improvement via moving it away
# from 1/2.
# "s" adjusts how much weight to give the prior assumption relative to
# the probabilities estimated by counting. At s=0, the counting estimates
# are believed 100%, even to the extent of assigning certainty (0 or 1)
-# to a word that's appeared in only ham or only spam. This is a disaster.
+# to a word that has appeared in only ham or only spam. This is a disaster.
# As s tends toward infintity, all probabilities tend toward x. All
-# reports were that a value near 0.4 worked best, so this doesn't seem to
+# reports were that a value near 0.4 worked best, so this does not seem to
# be corpus-dependent.
# NOTE: Gary Robinson previously used a different formula involving 'a'
# and 'x'. The 'x' here is the same as before. The 's' here is the old
@@ -249,11 +249,11 @@
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
-# tests over Robinson's base scheme. 0.1 appeared to work well across
+# tests over Robinsons base scheme. 0.1 appeared to work well across
# all corpora.
robinson_minimum_prob_strength: 0.1
-# The combining scheme currently detailed on Gary Robinon's web page.
+# The combining scheme currently detailed on Gary Robinons web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
# scores (near 0.0 or 1.0), but the tail ends of the ham and spam
@@ -261,15 +261,15 @@
use_gary_combining: False
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
-# follows the chi-squared distribution with 2*n degrees of freedom. That's
-# the "provably most-sensitive" test Gary's original scheme was monotonic
+# follows the chi-squared distribution with 2*n degrees of freedom. That is
+# the "provably most-sensitive" test Garys original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
-# lots of the mistakes live. This is the best method so far on Tim's data.
-# One systematic benefit is that it's immune to "cancellation disease". One
-# systematic drawback is that it's sensitive to *any* deviation from a
-# uniform distribution, regardless of whether that's actually evidence of
+# lots of the mistakes live. This is the best method so far on Tims data.
+# One systematic benefit is that it is immune to "cancellation disease". One
+# systematic drawback is that it is sensitive to *any* deviation from a
+# uniform distribution, regardless of whether that is actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
# In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95,
@@ -278,6 +278,26 @@
# with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
# (original c.l.p data, his own email, and newer general python.org traffic).
use_chi_squared_combining: True
+
+[Hammie]
+# The name of the header that hammie adds to an E-mail in filter mode
+header: X-Hammie-Disposition
+
+# The default database path used by hammie
+defaultdb: hammie.db
+
+# The range of clues that are added to the "hammie" header in the E-mail
+# All clues that have their probability smaller than this number, or larger
+# than one minus this number are added to the header such that you can see
+# why spambayes thinks this is ham/spam or why it is unsure. The default is
+# to show all clues, but you can reduce that by setting showclue to a lower
+# value, such as 0.1 (which Rob is using)
+showclue: 0.5
+
+# hammie can use either a database (quick to score one message) or a pickle
+# (quick to train on huge amounts of messages). Set this to True to use a
+# database by default.
+usedb: False
"""
int_cracker = ('getint', None)
@@ -333,6 +353,12 @@
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
},
+ 'Hammie': {'header': string_cracker,
+ 'defaultdb': string_cracker,
+ 'showclue': float_cracker,
+ 'usedb': boolean_cracker,
+ },
+
}
def _warn(msg):
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.30
diff -u -r1.30 hammie.py
--- hammie.py 27 Oct 2002 03:59:52 -0000 1.30
+++ hammie.py 27 Oct 2002 08:02:11 -0000
@@ -22,11 +22,14 @@
Only meaningful with the -u option.
-p FILE
use file as the persistent store. loads data from this file if it
- exists, and saves data to this file at the end. Default: %(DEFAULTDB)s
+ exists, and saves data to this file at the end.
+ Default: %(DEFAULTDB)s
-d
use the DBM store instead of cPickle. The file is larger and
creating it is slower, but checking against it is much faster,
- especially for large word databases.
+ especially for large word databases. Default: %(USEDB)s
+ -D
+ the reverse of -d: use the cPickle instead of DBM
-f
run as a filter: read a single message from stdin, add an
%(DISPHEADER)s header, and write it to stdout. If you want to
@@ -52,15 +55,21 @@
program = sys.argv[0] # For usage(); referenced by docstring above
# Name of the header to add in filter mode
-DISPHEADER = "X-Hammie-Disposition"
+DISPHEADER = options.header
# Default database name
-DEFAULTDB = "hammie.db"
+DEFAULTDB = options.defaultdb
# Probability at which a message is considered spam
SPAM_THRESHOLD = options.spam_cutoff
HAM_THRESHOLD = options.ham_cutoff
+# Probability limit for a clue to be added to the DISPHEADER
+SHOWCLUE = options.showclue
+
+# Use a database? If False, use a pickle
+USEDB = options.usedb
+
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
from tokenizer import tokenize
@@ -208,7 +217,10 @@
def formatclues(self, clues, sep="; "):
"""Format the clues into something readable."""
- return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
+ return sep.join(["%r: %.2f" % (word, prob)
+ for word, prob in clues
+ if (word[0] == '*' or
+ prob <= SHOWCLUE or prob >= 1.0 - SHOWCLUE)])
def score(self, msg, evidence=False):
"""Score (judge) a message.
@@ -377,7 +389,7 @@
def main():
"""Main program; parse options and go."""
try:
- opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:r')
+ opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
except getopt.error, msg:
usage(2, msg)
@@ -389,7 +401,8 @@
spam = []
unknown = []
reverse = 0
- do_filter = usedb = False
+ do_filter = False
+ usedb = USEDB
for opt, arg in opts:
if opt == '-h':
usage(0)
@@ -401,6 +414,8 @@
pck = arg
elif opt == "-d":
usedb = True
+ elif opt == "-D":
+ usedb = False
elif opt == "-f":
do_filter = True
elif opt == '-u':
---------------------- multipart/mixed attachment--