[Python-Dev] Unpickling memory usage problem, and a proposed solution

Alexandre Vassalotti alexandre at peadrop.com
Fri Apr 23 22:53:52 CEST 2010


On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin <dgindikin at gmail.com> wrote:
> This wouldn't help our use case, your code needs the entire pickle
> stream to be in memory, which in our case would be about 475mb, this
> is on top of the 300mb+ data structures that generated the pickle
> stream.
>

In that case, the best we could do is a two-pass algorithm to remove
the unused PUTs. That won't be efficient, but it will satisfy the
memory constraint. Another solution is to not generate the PUTs at all
by setting the 'fast' attribute on Pickler. But that won't work if you
have a recursive structure, or have code that requires that the
identity of objects to be preserved.

>>> import io, pickle
>>> x=[1,2]
>>> f = io.BytesIO()
>>> p = pickle.Pickler(f, protocol=-1)
>>> p.dump([x,x])
>>> pickletools.dis(f.getvalue())
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: ]        EMPTY_LIST
    7: q        BINPUT     1
    9: (        MARK
   10: K            BININT1    1
   12: K            BININT1    2
   14: e            APPENDS    (MARK at 9)
   15: h        BINGET     1
   17: e        APPENDS    (MARK at 5)
   18: .    STOP
highest protocol among opcodes = 2
>>> [id(x) for x in pickle.loads(f.getvalue())]
[20966504, 20966504]

Now with the 'fast' mode enabled:

>>> f = io.BytesIO()
>>> p = pickle.Pickler(f, protocol=-1)
>>> p.fast = True
>>> p.dump([x,x])
>>> pickletools.dis(f.getvalue())
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: (    MARK
    4: ]        EMPTY_LIST
    5: (        MARK
    6: K            BININT1    1
    8: K            BININT1    2
   10: e            APPENDS    (MARK at 5)
   11: ]        EMPTY_LIST
   12: (        MARK
   13: K            BININT1    1
   15: K            BININT1    2
   17: e            APPENDS    (MARK at 12)
   18: e        APPENDS    (MARK at 3)
   19: .    STOP
highest protocol among opcodes = 2
>>> [id(x) for x in pickle.loads(f.getvalue())]
[20966504, 21917992]

As you can observe, the pickle stream generated with the fast mode
might actually be bigger.

By the way, it is weird that the total memory usage of the data
structure is smaller than the size of its respective pickle stream.
What pickle protocol are you using?

-- Alexandre


More information about the Python-Dev mailing list