[Spambayes-checkins]
spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Tim Peters
tim_one@users.sourceforge.net
Mon Nov 11 01:59:08 2002
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402
Modified Files:
Options.py classifier.py weakloop.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py 8 Nov 2002 04:06:23 -0000 1.67
--- Options.py 11 Nov 2002 01:59:06 -0000 1.68
***************
*** 241,268 ****
# These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before. Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam. This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x. All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE: Gary Robinson previously used a different formula involving 'a'
! # and 'x'. The 'x' here is the same as before. The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme. 0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
! # The combining scheme currently detailed on Gary Robinons web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
--- 241,268 ----
# These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before. Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting. At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam. This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob. All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests. 0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
! # The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
***************
*** 272,284 ****
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease". One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. This is
! # the "provably most-sensitive" test the original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease". One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
},
'Classifier': {'max_discriminators': int_cracker,
! 'robinson_probability_x': float_cracker,
! 'robinson_probability_s': float_cracker,
! 'robinson_minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
},
'Classifier': {'max_discriminators': int_cracker,
! 'unknown_word_prob': float_cracker,
! 'unknown_word_strength': float_cracker,
! 'minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py 7 Nov 2002 22:30:05 -0000 1.49
--- classifier.py 11 Nov 2002 01:59:06 -0000 1.50
***************
*** 70,74 ****
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
nspam = float(self.nspam or 1)
! S = options.robinson_probability_s
! StimesX = S * options.robinson_probability_x
for word, record in self.wordinfo.iteritems():
--- 322,327 ----
nspam = float(self.nspam or 1)
! S = options.unknown_word_strength
! StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
def _getclues(self, wordstream):
! mindist = options.robinson_minimum_prob_strength
! unknown = options.robinson_probability_x
clues = [] # (distance, prob, word, record) tuples
--- 449,454 ----
def _getclues(self, wordstream):
! mindist = options.minimum_prob_strength
! unknown = options.unknown_word_prob
clues = [] # (distance, prob, word, record) tuples
Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1
--- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2
***************
*** 29,35 ****
default="""
[Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
max_discriminators = 150
--- 29,35 ----
default="""
[Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
max_discriminators = 150
***************
*** 41,47 ****
import Options
! start = (Options.options.robinson_probability_x,
! Options.options.robinson_minimum_prob_strength,
! Options.options.robinson_probability_s,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
--- 41,47 ----
import Options
! start = (Options.options.unknown_word_prob,
! Options.options.minimum_prob_strength,
! Options.options.unknown_word_strength,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
***************
*** 52,58 ****
f.write("""
[Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
[TestDriver]
--- 52,58 ----
f.write("""
[Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
[TestDriver]
More information about the Spambayes-checkins
mailing list