[Spambayes] Upgrade problem

Fri Nov 8 07:54:04 2002

Tim Peters wrote:

> [Just van Rossum]
> > I think it can be done with almost no extra overhead with a
> > caching scheme.  This assumes (probably wrongly <wink>) that
> > the cache stays in memory between runs.
> > Something like this perhaps:
> >
> > *** classifier.py   Thu Nov  7 23:03:07 2002
> > --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> > ***************
> > *** 456,459 ****
> > --- 456,460 ----
> >
> >           wordinfoget = self.wordinfo.get
> > +         spamprobget = self.spamprobcache.get
> >           now = time.time()
> >           for word in Set(wordstream):
> > ***************
> > *** 463,467 ****
> >               else:
> >                   record.atime = now
> > !                 prob = record.spamprob
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> > --- 464,470 ----
> >               else:
> >                   record.atime = now
> > !                 prob = spamprobget(word)
> > !                 if prob is None:
> > !                     prob = self.calcspamprob(word, record)
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> 
> Sorry, I don't know what this is trying to accomplish.  Like, what is
> self.spamprobcache?  There's no such thing now, and the patch doesn't appear
> to create one (i.e., this code doesn't run). 

Tim, don't be such a programmer <wink>. But ok, I promise I'll never post
pseudocode as a patch again...

> Whatever it's supposed to be,
> why isn't spamprobcache.get *itself* responsible for returning a spamprob,
> instead of making its caller deal with two cases? 

I thought I was doing your performance needs a favor <wink>.

> If the answer is "it's
> supposed to be a dict, so .get ain't that smart",

That's the answer.

> then the memory burden for
> a long-running scorer process will zoom, negating one of the benefits people
> attached to "real databases" thought they were buying in return for giant
> files and slothful performance <wink>.

Right. If a float takes up 20 bytes in memory (just a guess, no time to look),
then for a database of 100000 words (that's roughly the size of my personal db)
the memory burden is 100000 * (8 + 20), almost three megs.

Just in case the higher memory usage is not an issue, there's a simpler
approach: don't store spamprob in the db, but call bayes.update_probabilities()
on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC
on my db (hm, that's using pickle, so will be a lot more when using a db :-( ).
You can tell I'm thinking mostly about long running processes...

I guess you're right, one size doesn't fit all. One last idea for this morning:
how about splitting the db in a training db (storing hamcount and spamcount) and
a classifying db (storing only spamprob)?

> Life would be easier if databaseheads trained all they liked as often as
> they liked, but refrained from calling update_probabilities() until the end
> of the day (or other "quiet time").  The idea that the model should be
> updated after every msg trained on is an extreme.

Good points.

Just