performance of pickling & large lists :-(

Jeff Epler jepler at unpythonic.net
Tue Aug 6 20:41:16 CEST 2002


Python's "pickle" will use approximately one dict entry per object, of
the form
    { id(obj) : obj }
this is used to make sure that a structure like 'y' in 
    y = [[]]*2
is pickled correctly, and that when reloaded, 'y[0] is y[1]'.

This accounts for most, if not all, of pickle's memory usage.

If you are serializing a data-structure which has a tree structure,
rather than a fully general graph structure, you could use something
like pickle but without the "memo" data member.  Something like the
following might work:
    class emptydict:
	def __setitem__(self, k, v): pass
	def __getitem__(self, k): raise KeyError, k
	def has_key(self, k): return 0
	def get(self, k, d=None): return d
	def __len__(self): return 0

    class AmnesiacPickler(pickle.Pickler):
	def __init__(self, file, bin=0):
	    pickle.Pickler.__init__(self, file, bin)
	    self.memo = emptydict()
I tested it minimally (on python1.5), and it seemed to work.  Items like
the following unpickled properly:
    "hi there"
    [1, 2, 3], [4, 5, 6]
and the following unpickled as something with the same value, but with
identity wrong:
    [[]] * 2
This object 'r' caused fits:
    r = []
    r.append(r)
it died when it ran out of stack space, because the pickler didn't
recognize that it was recursively pickling the same thing.

Jeff




More information about the Python-list mailing list