[Spambayes] Upgrade problem
Just van Rossum
just@letterror.com
Fri Nov 8 07:54:04 2002
Tim Peters wrote:
> [Just van Rossum]
> > I think it can be done with almost no extra overhead with a
> > caching scheme. This assumes (probably wrongly <wink>) that
> > the cache stays in memory between runs.
> > Something like this perhaps:
> >
> > *** classifier.py Thu Nov 7 23:03:07 2002
> > --- classifier.py.hack Fri Nov 8 00:04:05 2002
> > ***************
> > *** 456,459 ****
> > --- 456,460 ----
> >
> > wordinfoget = self.wordinfo.get
> > + spamprobget = self.spamprobcache.get
> > now = time.time()
> > for word in Set(wordstream):
> > ***************
> > *** 463,467 ****
> > else:
> > record.atime = now
> > ! prob = record.spamprob
> > distance = abs(prob - 0.5)
> > if distance >= mindist:
> > --- 464,470 ----
> > else:
> > record.atime = now
> > ! prob = spamprobget(word)
> > ! if prob is None:
> > ! prob = self.calcspamprob(word, record)
> > distance = abs(prob - 0.5)
> > if distance >= mindist:
>
> Sorry, I don't know what this is trying to accomplish. Like, what is
> self.spamprobcache? There's no such thing now, and the patch doesn't appear
> to create one (i.e., this code doesn't run).
Tim, don't be such a programmer <wink>. But ok, I promise I'll never post
pseudocode as a patch again...
> Whatever it's supposed to be,
> why isn't spamprobcache.get *itself* responsible for returning a spamprob,
> instead of making its caller deal with two cases?
I thought I was doing your performance needs a favor <wink>.
> If the answer is "it's
> supposed to be a dict, so .get ain't that smart",
That's the answer.
> then the memory burden for
> a long-running scorer process will zoom, negating one of the benefits people
> attached to "real databases" thought they were buying in return for giant
> files and slothful performance <wink>.
Right. If a float takes up 20 bytes in memory (just a guess, no time to look),
then for a database of 100000 words (that's roughly the size of my personal db)
the memory burden is 100000 * (8 + 20), almost three megs.
Just in case the higher memory usage is not an issue, there's a simpler
approach: don't store spamprob in the db, but call bayes.update_probabilities()
on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC
on my db (hm, that's using pickle, so will be a lot more when using a db :-( ).
You can tell I'm thinking mostly about long running processes...
I guess you're right, one size doesn't fit all. One last idea for this morning:
how about splitting the db in a training db (storing hamcount and spamcount) and
a classifying db (storing only spamprob)?
> Life would be easier if databaseheads trained all they liked as often as
> they liked, but refrained from calling update_probabilities() until the end
> of the day (or other "quiet time"). The idea that the model should be
> updated after every msg trained on is an extreme.
Good points.
Just
More information about the Spambayes
mailing list