[Spambayes] Upgrade problem
Just van Rossum
just@letterror.com
Thu Nov 7 23:12:45 2002
Tim Peters wrote:
> [T. Alexander Popiel]
> > Why don't we just store the counts, and only compute the probabilities
> > when we need to reference them? Yes, it is more efficient for bulk
> > testing to only compute the probabilities once, but it's definitely
> > a lose for incremental training.
>
> Unqualified judgments are always wrong <wink>. I often get email in batches
> of 200, and scoring speed is important to me -- much more so than training
> speed. It will be even more so at python.org, where training probably won't
> occur more often than once a week, but scoring is ongoing around the clock.
I think it can be done with almost no extra overhead with a caching scheme. This
assumes (probably wrongly <wink>) that the cache stays in memory between runs.
Something like this perhaps:
*** classifier.py Thu Nov 7 23:03:07 2002
--- classifier.py.hack Fri Nov 8 00:04:05 2002
***************
*** 456,459 ****
--- 456,460 ----
wordinfoget = self.wordinfo.get
+ spamprobget = self.spamprobcache.get
now = time.time()
for word in Set(wordstream):
***************
*** 463,467 ****
else:
record.atime = now
! prob = record.spamprob
distance = abs(prob - 0.5)
if distance >= mindist:
--- 464,470 ----
else:
record.atime = now
! prob = spamprobget(word)
! if prob is None:
! prob = self.calcspamprob(word, record)
distance = abs(prob - 0.5)
if distance >= mindist:
Just
More information about the Spambayes
mailing list