[Spambayes-checkins] spambayes Options.py,1.72.2.2,1.72.2.3 classifier.py,1.53,1.53.2.1 dbdict.py,1.1.2.1,1.1.2.2

Wed Nov 20 06:06:30 2002

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21044

Modified Files:
      Tag: hammie-playground
	Options.py classifier.py dbdict.py 
Log Message:
* new classifier method to only update the probablity of a single
  word.  I want to try using this during word reads with the dbm
  method, to see if I can make training on single messages quicker.
* s/string/boolean/ in new pop3proxy option
* dbdict ''' to """ to cope with emacs syntax highlighting bogosity

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.2
retrieving revision 1.72.2.3
diff -C2 -d -r1.72.2.2 -r1.72.2.3
*** Options.py	20 Nov 2002 05:04:03 -0000	1.72.2.2
--- Options.py	20 Nov 2002 06:06:27 -0000	1.72.2.3
***************
*** 353,357 ****
  # a pickle (quick to train on huge amounts of messages). Set this to
  # True to use a database by default.
! hammiefilter_persistent_use_database: False

  [pop3proxy]
--- 353,357 ----
  # a pickle (quick to train on huge amounts of messages). Set this to
  # True to use a database by default.
! hammiefilter_persistent_use_database: True

  [pop3proxy]
***************
*** 454,458 ****
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
!                   'pop3proxy_persistent_use_database': string_cracker,
                    'pop3proxy_persistent_storage_file': string_cracker,
                    },
--- 454,458 ----
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
!                   'pop3proxy_persistent_use_database': boolean_cracker,
                    'pop3proxy_persistent_storage_file': string_cracker,
                    },

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53
retrieving revision 1.53.2.1
diff -C2 -d -r1.53 -r1.53.2.1
*** classifier.py	18 Nov 2002 18:23:09 -0000	1.53
--- classifier.py	20 Nov 2002 06:06:28 -0000	1.53.2.1
***************
*** 319,322 ****
--- 319,334 ----
          """

+         for word, record in self.wordinfo.iteritems():
+             self.update_word(word, record)
+                 
+     def update_word(self, word, record):
+         """Compute p(word) = prob(msg is spam | msg contains word).
+         
+         This is the Graham calculation, but stripped of biases, and
+         stripped of clamping into 0.01 thru 0.99.  The Bayesian
+         adjustment following keeps them in a sane range, and one
+         that naturally grows the more evidence there is to back up
+         a probability.
+         """
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
***************
*** 330,393 ****
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob

!         for word, record in self.wordinfo.iteritems():
!             # Compute p(word) = prob(msg is spam | msg contains word).
!             # This is the Graham calculation, but stripped of biases, and
!             # stripped of clamping into 0.01 thru 0.99.  The Bayesian
!             # adjustment following keeps them in a sane range, and one
!             # that naturally grows the more evidence there is to back up
!             # a probability.
!             hamcount = record.hamcount
!             assert hamcount <= nham
!             hamratio = hamcount / nham
! 
!             spamcount = record.spamcount
!             assert spamcount <= nspam
!             spamratio = spamcount / nspam

!             prob = spamratio / (hamratio + spamratio)

!             # Now do Robinson's Bayesian adjustment.
!             #
!             #         s*x + n*p(w)
!             # f(w) = --------------
!             #           s + n
!             #
!             # I find this easier to reason about like so (equivalent when
!             # s != 0):
!             #
!             #        x - p
!             #  p +  -------
!             #       1 + n/s
!             #
!             # IOW, it moves p a fraction of the distance from p to x, and
!             # less so the larger n is, or the smaller s is.

!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.

!             n = hamcount * spam2ham  +  spamcount * ham2spam
!             prob = (StimesX + n * prob) / (S + n)

!             if record.spamprob != prob:
!                 record.spamprob = prob
!                 # The next seemingly pointless line appears to be a hack
!                 # to allow a persistent db to realize the record has changed.
!                 self.wordinfo[word] = record

      def clearjunk(self, oldesttime):
--- 342,398 ----
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
+                 
+         hamcount = record.hamcount
+         assert hamcount <= nham
+         hamratio = hamcount / nham

!         spamcount = record.spamcount
!         assert spamcount <= nspam
!         spamratio = spamcount / nspam

!         prob = spamratio / (hamratio + spamratio)

!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.

!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.

!         n = hamcount * spam2ham  +  spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)

!         if record.spamprob != prob:
!             record.spamprob = prob
!             # The next seemingly pointless line appears to be a hack
!             # to allow a persistent db to realize the record has changed.
!             self.wordinfo[word] = record

      def clearjunk(self, oldesttime):

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1.2.1
retrieving revision 1.1.2.2
diff -C2 -d -r1.1.2.1 -r1.1.2.2
*** dbdict.py	20 Nov 2002 04:28:34 -0000	1.1.2.1
--- dbdict.py	20 Nov 2002 06:06:28 -0000	1.1.2.2
***************
*** 1,5 ****
  #! /usr/bin/env python

! '''DBDict.py - Dictionary access to dbhash

  Classes:
--- 1,5 ----
  #! /usr/bin/env python

! """DBDict.py - Dictionary access to dbhash

  Classes:
***************
*** 42,46 ****

  To Do:
!     '''

  # This module is part of the spambayes project, which is Copyright 2002
--- 42,46 ----

  To Do:
!     """

  # This module is part of the spambayes project, which is Copyright 2002