[Python-ideas] new pickle semantics/API

tomer filiba tomerfiliba at gmail.com
Fri Jan 26 10:29:50 CET 2007


On 1/26/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> *I* use pickle when I want speed.  I use repr and unrepr (from configobj)
> when I want to be able to change things by hand.  None of the objects I
> transfer between processes/disk/whatever are ever more than base Python
> types.
>
> As for whatever *others* do, I don't know, I haven't done a survey.
> They may do as you say.

well, since this proposal originated from an RPC point-of-view,
i, for once, also need to transfer full-blown user-defined objects
and classes. for that, i need to be able to get the full state of
the object, and the means to reconstruct it later.

pickling the "primitives" was never a problem, because they have
a well defined interface. although cyclic references call for something
stronger than repr... and this is what i mean by "repr is for humans".

> I never claimed it was a substitute, I stated quite clearly up front;
> "My only criticism will then be the strawman repr/unrepr."  The
> alternate strawman is repr/eval, can be made to support basically
> everything except for self-referential objects.  It still has that
> horrible security issue, but that also exists in pickle, and could still
> exist in the simplification/rebuilding rutines specified by Tomer.

well, pickle is unsafe for one main reason -- it's omnipotent.
it performs arbitrary imports and object instantiation, which are
equivalent to eval of arbitrary strings.

BUT, it has nothing to do with the way it gets of sets the *state*
of objects. to solve that, we can have a "capability-based pickle":
dumping objects was never a security issue, it's the loading part
that's dangerous.

we can add a new function, loadsec(), that takes both the string
to load and a set of classes it may use to __rebuilt__ the object.

capabilities = {"list" : list, "str" : str, "os.stat_result" : os.stat_result}
loadsec(data, capabilities)

that way, you can control the objects that will be instantiated,
which you trust, so no arbitrary code may be executed
behind the scenes.

for the "classic" unsafe load(), we can pass a magical dict-like
thing that imports names via __getitem__

if i had a way to control what pickle.loads has access to, i
wouldn't need to write my own serializer..
http://sebulbasvn.googlecode.com/svn/trunk/rpyc/core/marshal/brine.py

> > you may have digressed, but that's a good point -- that's exactly
> > why i do NOT specify how objects are encoded as a stream of bytes.
> >
> > all i'm after is the state of the object (which is expressed in terms of
> > other, more primitive objects).
>
> Right, but as 'primative objects' go, you cant get significantly more
> primitive than producing a string that can be naively understood by
> someone familliar with Python *and* the built-in Python parser.

but that's not the issue. when i send my objects across a socket
back and forth, i want something that is fast, compact, and safe.
i don't care for anyone sniffing my wire to "easily understand"
what i'm sending... i mean, it's not like i encrypt the data, but
readability doesn't count here.

again, __simplify__ and __rebuild__ offer a mechanism for
serializer-implementors, which may choose different encoding
schemes for their internal purposes. this API isn't meant for
the end user.

i'll write a pre-pep to try to clarify it all in an orderly manner.


-tomer



More information about the Python-ideas mailing list