[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.5,1.6
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Sat, 31 Aug 2002 17:05:43 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6170
Modified Files:
classifier.py
Log Message:
spamprob(): Never count unique words more than once anymore. Counting
up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
that's now a small drag instead.
Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** classifier.py 31 Aug 2002 20:47:28 -0000 1.5
--- classifier.py 1 Sep 2002 00:05:41 -0000 1.6
***************
*** 126,135 ****
smallest_best = -1.0
! # Counting unique words multiple times has some benefit, but not
! # counting them an unbounded number of times (then one unlucky
! # repetition can be the entire score!). We count a word at most
! # two times.
! word2count = {}
! word2countget = word2count.get
wordinfoget = self.wordinfo.get
--- 126,134 ----
smallest_best = -1.0
! # Counting a unique word multiple times hurts, although counting one
! # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
! # When that got boosted to 0.5, counting more than once became
! # counterproductive.
! unique_words = {}
wordinfoget = self.wordinfo.get
***************
*** 137,144 ****
for word in wordstream:
! count = word2countget(word, 0) + 1
! if count > 2:
continue
! word2count[word] = count
record = wordinfoget(word)
--- 136,142 ----
for word in wordstream:
! if word in unique_words:
continue
! unique_words[word] = 1
record = wordinfoget(word)