[Spambayes-checkins]
spambayes Options.py,1.70,1.71 classifier.py,1.50,1.51
Tim Peters
tim_one@projects.sourceforge.net
Mon Nov 18 01:40:06 2002
- Previous message: [Spambayes-checkins]
spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43
hammie.py,1.38,1.39 mboxutils.py,1.6,1.7
- Next message: [Spambayes-checkins]
spambayes/Outlook2000 default_bayes_customize.ini,1.6,1.7
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv24664
Modified Files:
Options.py classifier.py
Log Message:
Added option experimental_ham_spam_imbalance_adjustment. Please test!
Especially if you train on a lot more ham than spam (or vice versa).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.70
retrieving revision 1.71
diff -C2 -d -r1.70 -r1.71
*** Options.py 13 Nov 2002 18:14:32 -0000 1.70
--- Options.py 18 Nov 2002 01:40:03 -0000 1.71
***************
*** 298,301 ****
--- 298,315 ----
use_chi_squared_combining: True
+ # If the # of ham and spam in training data are out of balance, the
+ # spamprob guesses can get stronger in the direction of the category with
+ # more training msgs. In one sense this must be so, since the more data
+ # we have of one flavor, the more we know about that flavor. But that
+ # allows the accidental appearance of a strong word of that flavor in a msg
+ # of the other flavor much more power than an accident in the other
+ # direction. Enable experimental_ham_spam_imbalance_adjustment if you have
+ # more ham than spam training data (or more spam than ham), and the
+ # Bayesian probability adjustment won't 'believe' raw counts more than
+ # min(# ham trained on, # spam trained on) justifies. I *expect* this
+ # option will go away (and become the default), but people *with* strong
+ # imbalance need to test it first.
+ experimental_ham_spam_imbalance_adjustment: False
+
[Hammie]
# The name of the header that hammie adds to an E-mail in filter mode
***************
*** 410,414 ****
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
! },
'Hammie': {'hammie_header_name': string_cracker,
'persistent_storage_file': string_cracker,
--- 424,429 ----
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
! 'experimental_ham_spam_imbalance_adjustment': boolean_cracker,
! },
'Hammie': {'hammie_header_name': string_cracker,
'persistent_storage_file': string_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.50
retrieving revision 1.51
diff -C2 -d -r1.50 -r1.51
*** classifier.py 11 Nov 2002 01:59:06 -0000 1.50
--- classifier.py 18 Nov 2002 01:40:04 -0000 1.51
***************
*** 322,330 ****
nspam = float(self.nspam or 1)
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
! # Compute prob(msg is spam | msg contains word).
# This is the Graham calculation, but stripped of biases, and
# stripped of clamping into 0.01 thru 0.99. The Bayesian
--- 322,336 ----
nspam = float(self.nspam or 1)
+ if options.experimental_ham_spam_imbalance_adjustment:
+ spam2ham = min(nspam / nham, 1.0)
+ ham2spam = min(nham / nspam, 1.0)
+ else:
+ spam2ham = ham2spam = 1.0
+
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
! # Compute p(word) = prob(msg is spam | msg contains word).
# This is the Graham calculation, but stripped of biases, and
# stripped of clamping into 0.01 thru 0.99. The Bayesian
***************
*** 358,362 ****
# less so the larger n is, or the smaller s is.
! n = hamcount + spamcount
prob = (StimesX + n * prob) / (S + n)
--- 364,386 ----
# less so the larger n is, or the smaller s is.
! # Experimental:
! # Picking a good value for n is interesting: how much empirical
! # evidence do we really have? If nham == nspam,
! # hamcount + spamcount makes a lot of sense, and the code here
! # does that by default.
! # But if, e.g., nham is much larger than nspam, p(w) can get a
! # lot closer to 0.0 than it can get to 1.0. That in turn makes
! # strong ham words (high hamcount) much stronger than strong
! # spam words (high spamcount), and that makes the accidental
! # appearance of a strong ham word in spam much more damaging than
! # the accidental appearance of a strong spam word in ham.
! # So we don't give hamcount full credit when nham > nspam (or
! # spamcount when nspam > nham): instead we knock hamcount down
! # to what it would have been had nham been equal to nspam. IOW,
! # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
! # we don't "believe" any count to an extent more than
! # min(nspam, nham) justifies.
!
! n = hamcount * spam2ham + spamcount * ham2spam
prob = (StimesX + n * prob) / (S + n)
- Previous message: [Spambayes-checkins]
spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43
hammie.py,1.38,1.39 mboxutils.py,1.6,1.7
- Next message: [Spambayes-checkins]
spambayes/Outlook2000 default_bayes_customize.ini,1.6,1.7
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
More information about the Spambayes-checkins
mailing list