[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.5,1.6

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Sat, 31 Aug 2002 17:05:43 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6170

Modified Files:
	classifier.py 
Log Message:
spamprob():  Never count unique words more than once anymore.  Counting
up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
that's now a small drag instead.


Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** classifier.py	31 Aug 2002 20:47:28 -0000	1.5
--- classifier.py	1 Sep 2002 00:05:41 -0000	1.6
***************
*** 126,135 ****
          smallest_best = -1.0
  
!         # Counting unique words multiple times has some benefit, but not
!         # counting them an unbounded number of times (then one unlucky
!         # repetition can be the entire score!).  We count a word at most
!         # two times.
!         word2count = {}
!         word2countget = word2count.get
  
          wordinfoget = self.wordinfo.get
--- 126,134 ----
          smallest_best = -1.0
  
!         # Counting a unique word multiple times hurts, although counting one
!         # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
!         # When that got boosted to 0.5, counting more than once became
!         # counterproductive.
!         unique_words = {}
  
          wordinfoget = self.wordinfo.get
***************
*** 137,144 ****
  
          for word in wordstream:
!             count = word2countget(word, 0) + 1
!             if count > 2:
                  continue
!             word2count[word] = count
  
              record = wordinfoget(word)
--- 136,142 ----
  
          for word in wordstream:
!             if word in unique_words:
                  continue
!             unique_words[word] = 1
  
              record = wordinfoget(word)