[Spambayes-checkins]
spambayes Options.py,1.72.2.2,1.72.2.3 classifier.py,1.53,1.53.2.1
dbdict.py,1.1.2.1,1.1.2.2
Neale Pickett
npickett@users.sourceforge.net
Wed Nov 20 06:06:30 2002
Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21044
Modified Files:
Tag: hammie-playground
Options.py classifier.py dbdict.py
Log Message:
* new classifier method to only update the probablity of a single
word. I want to try using this during word reads with the dbm
method, to see if I can make training on single messages quicker.
* s/string/boolean/ in new pop3proxy option
* dbdict ''' to """ to cope with emacs syntax highlighting bogosity
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.2
retrieving revision 1.72.2.3
diff -C2 -d -r1.72.2.2 -r1.72.2.3
*** Options.py 20 Nov 2002 05:04:03 -0000 1.72.2.2
--- Options.py 20 Nov 2002 06:06:27 -0000 1.72.2.3
***************
*** 353,357 ****
# a pickle (quick to train on huge amounts of messages). Set this to
# True to use a database by default.
! hammiefilter_persistent_use_database: False
[pop3proxy]
--- 353,357 ----
# a pickle (quick to train on huge amounts of messages). Set this to
# True to use a database by default.
! hammiefilter_persistent_use_database: True
[pop3proxy]
***************
*** 454,458 ****
'pop3proxy_ham_cache': string_cracker,
'pop3proxy_unknown_cache': string_cracker,
! 'pop3proxy_persistent_use_database': string_cracker,
'pop3proxy_persistent_storage_file': string_cracker,
},
--- 454,458 ----
'pop3proxy_ham_cache': string_cracker,
'pop3proxy_unknown_cache': string_cracker,
! 'pop3proxy_persistent_use_database': boolean_cracker,
'pop3proxy_persistent_storage_file': string_cracker,
},
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53
retrieving revision 1.53.2.1
diff -C2 -d -r1.53 -r1.53.2.1
*** classifier.py 18 Nov 2002 18:23:09 -0000 1.53
--- classifier.py 20 Nov 2002 06:06:28 -0000 1.53.2.1
***************
*** 319,322 ****
--- 319,334 ----
"""
+ for word, record in self.wordinfo.iteritems():
+ self.update_word(word, record)
+
+ def update_word(self, word, record):
+ """Compute p(word) = prob(msg is spam | msg contains word).
+
+ This is the Graham calculation, but stripped of biases, and
+ stripped of clamping into 0.01 thru 0.99. The Bayesian
+ adjustment following keeps them in a sane range, and one
+ that naturally grows the more evidence there is to back up
+ a probability.
+ """
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
***************
*** 330,393 ****
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
! for word, record in self.wordinfo.iteritems():
! # Compute p(word) = prob(msg is spam | msg contains word).
! # This is the Graham calculation, but stripped of biases, and
! # stripped of clamping into 0.01 thru 0.99. The Bayesian
! # adjustment following keeps them in a sane range, and one
! # that naturally grows the more evidence there is to back up
! # a probability.
! hamcount = record.hamcount
! assert hamcount <= nham
! hamratio = hamcount / nham
!
! spamcount = record.spamcount
! assert spamcount <= nspam
! spamratio = spamcount / nspam
! prob = spamratio / (hamratio + spamratio)
! # Now do Robinson's Bayesian adjustment.
! #
! # s*x + n*p(w)
! # f(w) = --------------
! # s + n
! #
! # I find this easier to reason about like so (equivalent when
! # s != 0):
! #
! # x - p
! # p + -------
! # 1 + n/s
! #
! # IOW, it moves p a fraction of the distance from p to x, and
! # less so the larger n is, or the smaller s is.
! # Experimental:
! # Picking a good value for n is interesting: how much empirical
! # evidence do we really have? If nham == nspam,
! # hamcount + spamcount makes a lot of sense, and the code here
! # does that by default.
! # But if, e.g., nham is much larger than nspam, p(w) can get a
! # lot closer to 0.0 than it can get to 1.0. That in turn makes
! # strong ham words (high hamcount) much stronger than strong
! # spam words (high spamcount), and that makes the accidental
! # appearance of a strong ham word in spam much more damaging than
! # the accidental appearance of a strong spam word in ham.
! # So we don't give hamcount full credit when nham > nspam (or
! # spamcount when nspam > nham): instead we knock hamcount down
! # to what it would have been had nham been equal to nspam. IOW,
! # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
! # we don't "believe" any count to an extent more than
! # min(nspam, nham) justifies.
! n = hamcount * spam2ham + spamcount * ham2spam
! prob = (StimesX + n * prob) / (S + n)
! if record.spamprob != prob:
! record.spamprob = prob
! # The next seemingly pointless line appears to be a hack
! # to allow a persistent db to realize the record has changed.
! self.wordinfo[word] = record
def clearjunk(self, oldesttime):
--- 342,398 ----
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
+
+ hamcount = record.hamcount
+ assert hamcount <= nham
+ hamratio = hamcount / nham
! spamcount = record.spamcount
! assert spamcount <= nspam
! spamratio = spamcount / nspam
! prob = spamratio / (hamratio + spamratio)
! # Now do Robinson's Bayesian adjustment.
! #
! # s*x + n*p(w)
! # f(w) = --------------
! # s + n
! #
! # I find this easier to reason about like so (equivalent when
! # s != 0):
! #
! # x - p
! # p + -------
! # 1 + n/s
! #
! # IOW, it moves p a fraction of the distance from p to x, and
! # less so the larger n is, or the smaller s is.
! # Experimental:
! # Picking a good value for n is interesting: how much empirical
! # evidence do we really have? If nham == nspam,
! # hamcount + spamcount makes a lot of sense, and the code here
! # does that by default.
! # But if, e.g., nham is much larger than nspam, p(w) can get a
! # lot closer to 0.0 than it can get to 1.0. That in turn makes
! # strong ham words (high hamcount) much stronger than strong
! # spam words (high spamcount), and that makes the accidental
! # appearance of a strong ham word in spam much more damaging than
! # the accidental appearance of a strong spam word in ham.
! # So we don't give hamcount full credit when nham > nspam (or
! # spamcount when nspam > nham): instead we knock hamcount down
! # to what it would have been had nham been equal to nspam. IOW,
! # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
! # we don't "believe" any count to an extent more than
! # min(nspam, nham) justifies.
! n = hamcount * spam2ham + spamcount * ham2spam
! prob = (StimesX + n * prob) / (S + n)
! if record.spamprob != prob:
! record.spamprob = prob
! # The next seemingly pointless line appears to be a hack
! # to allow a persistent db to realize the record has changed.
! self.wordinfo[word] = record
def clearjunk(self, oldesttime):
Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1.2.1
retrieving revision 1.1.2.2
diff -C2 -d -r1.1.2.1 -r1.1.2.2
*** dbdict.py 20 Nov 2002 04:28:34 -0000 1.1.2.1
--- dbdict.py 20 Nov 2002 06:06:28 -0000 1.1.2.2
***************
*** 1,5 ****
#! /usr/bin/env python
! '''DBDict.py - Dictionary access to dbhash
Classes:
--- 1,5 ----
#! /usr/bin/env python
! """DBDict.py - Dictionary access to dbhash
Classes:
***************
*** 42,46 ****
To Do:
! '''
# This module is part of the spambayes project, which is Copyright 2002
--- 42,46 ----
To Do:
! """
# This module is part of the spambayes project, which is Copyright 2002
More information about the Spambayes-checkins
mailing list