[Python-Dev] dict.setdefault(object, object) instead of "sys.intern()" (was Re: sys.intern should work on bytes)

Stefan Behnel stefan_ml at behnel.de
Sat Sep 21 11:15:28 CEST 2013

Jesus Cea, 20.09.2013 15:46:
> On 20/09/13 15:33, Benjamin Peterson wrote:
>> Well, the pickler should memoize bytes objects if you have lots of
>> the same one in a pickle...
> Only if they are the very same object. Not diferent bytes objects with
> the same value. Pickle doesn't do "a==b" but "id(a)==id(b)".
> Yes, I know that "a==b" would break mutable objects. It is just an
> example.
> I don't want to pursue that path. Performance of pickle is already
> appallingly slow.
> In my project, I will do the redundancy removal on my own way, as
> explained in ither message on this thread.
> Example:
> * Original pickle: 14416284 bytes
> * Pickle with "interned" strings: 3004880 bytes
> (quite an improvement, but this is particular to my case, I have a lot
> of string duplications here. The pickle also loads a bit faster)
> * Pickle including an extra dictionary of "interned" strings, created
> using the "interned.setdefault(object,object)" pattern: 5126587 bytes.
> Sniff.
> Could I do this more compactly?.

ISTM that what you are looking for is a compression-like pattern that
efficiently encodes repeated literals (i.e. constants of safe types) in the
pickle. That could be achieved by extending the pickle protocol to include
backreferences to earlier objects, I guess (I'm not all that familiar with
the internals of the pickle format). Any of the well known compression
algorithms that are capable of handling streaming data would apply here.
Assuming you don't want to simply send the pickle output through gzip &
friends, that is...

It also seems to me that python-dev isn't the right place to discuss this.
python-ideas seems more appropriate for now.


More information about the Python-Dev mailing list