[Spambayes] Database reduction

Tim Peters tim.one@comcast.net
Mon Nov 4 23:49:16 2002


[Neale Pickett]
> Right.  I had some code in hammie to pickle the tuple instead of the
> object itself, but I thought it was a pretty gnarly kludge at the time.
> In any case, some variation on this seems obviously the right way to go.

If you use __getstate__() to get the tuple, there's nothing objectionable
about it:  it's the *purpose* of __getstate__/__setstate__ to get/set state
into/from tuples.  Objectionable would be to access the fields directly
yourself by name, since they may change over time.  There's a problem here,
though, in that only the Bayes class saves a PICKLE_VERSION identifier in
its pickles; changes in WordInfo structure can't be transparent to old
databases unless WordInfo pickles contained a version identifier too.

>> I'd avoid all that and pickle the states, but that's just me.

> I'm inclined to agree with you.  If I do this, though, we have to all
> agree on a convention: if you need to modify a wordinfo object, you
> *must* write it back to the dictionary.  Otherwise hammie will never
> know it changed.  I was bitten by this a few times at first, and I
> haven't played with the code enough to know if any of this has crept
> back in.

I fixed one of those today.  The database still isn't getting updated with
the new word atimes during scoring, but I've ignored that because nobody has
made any use of atimes yet.

I have to say it's painful to do these redundant stores -- it generally
doubles the number of dict operations, and that's a speed drag.  However,
compared to I/O and tokenization times, it appears to be a minor drag at
worst.

> Would it be out of line to alter WordInfo to be immutable, to encourage
> folks to write it back to the dictionary?

I've done enough bending backwards for a subsystem I don't use <wink>.
There are only a handful of places these structs mutate.