pickle.py and cPickle.c - persistent_id() is always called - why?

Mon Apr 24 11:55:59 EDT 2000

(Previously posted via Deja.com by matsaleh but I never saw the post on
usenet - forgive me if this shows up more than once.)

Hello fellow .py types...

I am doing a bit of work using pickle/cpickle and am trying to
optimize. I found that my user-defined persistent_id() function is being
called for (just about) every attribute (name and value) in my object that
is being pickled.

My specific (somewhat ad-hoc) test case pickles ~100 objects in a
containment hierarchy, using persistent_id to break the containment
relationship and replace the references with a proprietary object id.
Although I have only ~100 object references to resolve, persistent_id() is
called ~6400 times. In tracing the code, it appears to be called for every
attribute and value in my objects.

This appears to be caused by the fact that pickle.py:save() is called with
the default pers_save flag of 0 in all cases except for when it is called by
save_pers(). This appears to indicate that all types: tuples, dicts,
sequences, longs, strings, etc, cause my persistent_id() to be called, when
all I want is for it to be called for object instances.

I modified pickle.py to change the default of the pers_save flag to 1 in the
save() method, and then call it with a 0 only from within the save_inst()
method. This amounts to invoking my persistent_id()
function only when a reference to an object instance is encountered as my
objects are being pickled.

My pickled objects do not appear to be adversely  affected by this change,
and the number of calls to persistent_id() was reduced from ~6400 to ~500,
reducing the time spent in this method by an order of magnitude (0.37 sec to
0.03 sec).

I have not yet tested this change in cPickle.c, but in my original tests,
persistent_id() was a much more significant factor in my runs using cPickle,
because all the pickling code is in C and is relatively much faster than my
persistent_id() method, which is in Python. I expect the performance boost
my making the change in cPickle to be even greater, relatively speaking.

My question is, why is persistent_id() being invoked so often? I do not see
the reason for calling it for basic python types and structures such as
tuples, dicts, and the like. Is this a reasonable change to make to the
python source, or is it not a safe change for the general pickling cases
that the pickle/cPickle modules have to handle? I'm sure this code has been
scruitinized by many folks much more experienced than I - I would be
grateful for your insights.

All timings and function counts done by profile.py and pstats.py.

Regards.