[Spambayes] Upgrade problem

Just van Rossum just@letterror.com
Thu Nov 7 23:12:45 2002


Tim Peters wrote:

> [T. Alexander Popiel]
> > Why don't we just store the counts, and only compute the probabilities
> > when we need to reference them?  Yes, it is more efficient for bulk
> > testing to only compute the probabilities once, but it's definitely
> > a lose for incremental training.
> 
> Unqualified judgments are always wrong <wink>.  I often get email in batches
> of 200, and scoring speed is important to me -- much more so than training
> speed.  It will be even more so at python.org, where training probably won't
> occur more often than once a week, but scoring is ongoing around the clock.

I think it can be done with almost no extra overhead with a caching scheme. This
assumes (probably wrongly <wink>) that the cache stays in memory between runs.
Something like this perhaps:

*** classifier.py   Thu Nov  7 23:03:07 2002
--- classifier.py.hack  Fri Nov  8 00:04:05 2002
***************
*** 456,459 ****
--- 456,460 ----
  
          wordinfoget = self.wordinfo.get
+         spamprobget = self.spamprobcache.get
          now = time.time()
          for word in Set(wordstream):
***************
*** 463,467 ****
              else:
                  record.atime = now
!                 prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist:
--- 464,470 ----
              else:
                  record.atime = now
!                 prob = spamprobget(word)
!                 if prob is None:
!                     prob = self.calcspamprob(word, record)
              distance = abs(prob - 0.5)
              if distance >= mindist:


Just



More information about the Spambayes mailing list