
---------- Forwarded message ---------- From: tomer filiba <tomerfiliba@gmail.com> Date: Jan 24, 2007 10:45 PM Subject: new pickle semantics/API To: Python-3000@python.org i'm having great trouble in RPyC with pickling object proxies. several users have asked for this feature, but no matter how hard i try to "bend the truth", pickle always complains. it uses type(obj) for the dispatching, which "uncovers" the object is actually a proxy, rather than a real object. recap: RPyC uses local proxies that refer to objects of a remote interpreter (another process/machine). if you'd noticed, every RPC framework has its own serializer. for example banna/jelly in twisted and bunch of other XML serializers, and what not. for RPyC i wrote yet another serializer, but for different purposes, so it's not relevant for the issue at hand. what i want is a standard serialization *API*. the idea is that any framework could make use of this API, and that it would be generic enough to eliminate copy_reg and other misfortunes. this also means the built in types should be familiarized with this API. - - - - - - - - for example, currently the builtin types don't support __reduce__, and require pickle to use it's own internal registry. moreover, __reduce__ is very pickle-specific (i.e., it takes the protocol number). what i'm after is an API for "simplifying" complex objects into simpler parts. here's the API i'm suggesting: def __getstate__(self): # return a tuple of (type(self), obj), where obj is a simplified # version of self @classmethod def __setstate__(cls, state): # return an instance of cls, with the given state well, you may already know these two, although their semantics are different. but wait, there's more! the idea is of having the following simple building blocks: * integers (int/long) * strings (str) * arrays (tuples) all picklable objects should be able to express themselves as a collection of these building blocks. of course this will be recursive, i.e., object X could simplify itself as object Y, where object Y might go further simplification, until we are left with building blocks only. for example: * int - return self * float - string in the format "[+-]X.YYYe[+-]EEE" * complex - two floats * tuple - tuple of its simplified elements * list - tuple of its simplified elements * dict - a tuple of (key, value) tuples * set - a tuple of its items * file - raises TypeError("can't be simplified") all in all, i choose to call that *simplification* rather than *serialization*, as serialization is more about converting the simplified objects into a sequence of bytes. my suggestion leaves that out for the implementers of specific serializers. so this is how a typical serializer (e.g., pickle) would be implemented: * define its version of a "recursive simplifier" * optionally use a "memo" to remember objects that were already visited (so they would be serialized by reference rather than by value) * define its flavor of converting ints, strings, and arrays to bytes (binary, textual, etc. etc.) - - - - - - - - the default implementation of __getstate__, in object.__getstate__, will simply return self.__dict__ and any self.__slots__ this removes the need for __reduce__, __reduce_ex__, and copy_reg, and simplifies pickle greatly. it does require, however, adding support for simplification for all builtin types... but this doesn't call for much code: def PyList_GetState(self): state = tuple(PyObject_GetState(item) for item in self) return PyListType, state also note that it makes the copy module much simpler: def copy(obj): state = obj.__getstate__() return type(obj).__setstate__(state) - - - - - - - - executive summary: simplifying object serialization and copying by revising __getstate__ and __setstate__, so that they return a "simplified" version of the object. this new mechanism should become an official API to getting or setting the "contents" of objects (either builtin or user-defined). having this unified mechanism, pickling proxy objects would work as expected. if there's interest, i'll write a pep-like document to explain all the semantics. -tomer

there's a bug in the copy function that i wanted to fix: def copy(obj): cls, state = obj.__getstate__() return cls.__setstate__(state) also, but there's a reason why __getstate__ returns "(cls, state)" rather than just "state", and that's to keep things agile. i don't want to be necessarily tightly-coupled to a certain type. the cls will be used to reconstruct the object (cls.__setstate__), much like the function returned by __reduce__, so normally, __getstate__ would just return self.__class__ for cls. but there are times, especially when object proxies are involved, that we may want to "lie" about the actual type, i.e., use a different type to reconstruct the object with. here's an example that shows why: class ListProxy: ... def __getstate__(self): return list, self._value instances of ListProxy, when stored in a file (i.e., shelf), want to be pickled by value. moreover, when they are loaded from a file, they want to loaded as actual lists, not proxies, as the proxied object is long lost. so returning a tuple of (cls, state) gives more freedom to frameworks and other utilities. of course, again, most code would just return (self.__class__, state) -tomer

On 1/25/07, Collin Winter <collinw@gmail.com> wrote:
How will e.g. classes be simplified? Can I simplify a dictionary with function objects for values?
well, pickle just saves them as a global name (modulename.classname). so types would generally just return themselves as primitives, and let the actual simplifier do the trick. it may choose to save the type's dict, or just a global name. that's up to the serializer-dependent simplifier. it's good you mentioned that, because it reminded me of something i forgot. for instance, code objects will be serialized by value, so you could actually pickle functions and classes. this means pyc files could become just a pickle of the module, i.e.: import foo pickle.dump(foo, open("foo.pyc", "w")) but again, that's up to the serializer. an enhanced pickle could do that. -tomer

On 1/25/07, tomer filiba <tomerfiliba@gmail.com> wrote:
Are you intending to simplify code objects to the co_* attributes? co_code is a string of interpreter-specific bytecode, which would be pretty much useless outside of a copy() function. Whatever the scheme, it will need to take into account cell objects and globals. Collin Winter

"tomer filiba" <tomerfiliba@gmail.com> wrote:
I presume you mean... def copy(obj): typ, state = obj.__getstate__() return typ.__setstate__(state)
Overall, I like the idea; I'm a big fan of simplifying object persistence and/or serialization. A part of me also likes how the objects can choose to lie about their types. But another part of me says; the basic objects that you specified already have a format that is unambiguous, repr(obj). They also are able to be reconstructed from their component parts via eval(repr(obj)), or even via the 'unrepr' function in the ConfigObj module. It doesn't handle circular referencse. Even better, it has 3 native representations; repr(a).encode('zlib'), repr(a), pprint.pprint(a); each offering a different amount of user readability. I digress. I believe the biggest problem with the proposal, as specified, is that changing the semantics of __getstate__ and __setstate__ is a bad idea. Add a new pair of methods and ask the twisted people what they think. My only criticism will then be the strawman repr/unrepr. - Josiah

On 1/25/07, Josiah Carlson <jcarlson@uci.edu> wrote:
well, repr is fine for most simple things, but you don't use repr to serialize objects, right? it's not powerful/introspective enough. besides repr is meant to be readable, while __getstate__ can return any object. imagine this: class complex: def __repr__(self): return "(%f+%fj)" % (self.real, self.imag) def __getstate__(self): return self.__class__, (self.real, self.imag) repr is made for humans of course, while serialization is made for machines. they serves different purposes, so they need different APIs.
you may have digressed, but that's a good point -- that's exactly why i do NOT specify how objects are encoded as a stream of bytes. all i'm after is the state of the object (which is expressed in terms of other, more primitive objects). you can think of repr as a textual serializer to some extent, that can use the proposed __getstate__ API. pprint is yet another form of serializer.
i'll try to come up with new names... but i don't have any ideas at the moment. -tomer

On 1/25/07, tomer filiba <tomerfiliba@gmail.com> wrote:
The "__getstate__" and "__setstate__" names don't really work for me either, especially since __setstate__ creates a new object, as opposed to changing the state of an existing object. Since this proposal is all about simplification, how about something like "__simplify__" and "__expand__", respectively? Collin Winter

On 1/26/07, Collin Winter <collinw@gmail.com> wrote:
well, i like __simplify__, but __expand__ seems wrong to me. some other suggestions i looked up: __complicate__ (:-)) __rebuild__ __reconstruct__ (too long?) __restore__ (but that's semantically the same as __setstate__) i'd vote for __rebuild__. any other ideas? -tomer

On Fri, Jan 26, 2007 at 12:41:36AM +0200, tomer filiba wrote:
__(un)pickle__ __(de)serialize__ Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

"tomer filiba" <tomerfiliba@gmail.com> wrote:
I use repr to serialize objects all the time. ConfigObj is great when I want to handle python-based configuration information, and/or I don't want to worry about the security implications of 'eval(arbitrary string)', or 'import module'. With a proper __repr__ method, I can even write towards your API: class mylist(object): def __repr__(self): state = ... return 'mylist.__setstate__(%r)'%(state,)
I would use 'return "(%r+%rj)"% (self.real, self.imag)', but it doesn't much matter.
I happen to disagree. The only reason to use a different representation or API is if there are size and/or performance benefits to offering a machine readable vs. human readable format. I'm know that there are real performance advantages to using (c)Pickle over repr/unrepr, but I use it also so that I can change settings with notepad (as has been necessary on occasion).
Right, but as 'primative objects' go, you cant get significantly more primitive than producing a string that can be naively understood by someone familliar with Python *and* the built-in Python parser. Nevermind that it works *today* with all of the types you specified earlier (with the exception of file objects - which you discover on parsing/reproducing the object).
Well, pprint is more or less a pretty repr.
Like Colin, I also like __rebuild__. - Josiah

On 1/25/07, Josiah Carlson <jcarlson@uci.edu> wrote:
You use pickle because it's more general than repr/unrepr, not because it's faster or the result is smaller. Assuming you're talking about the ConfigObj's unrepr mode (http://www.voidspace.org.uk/python/configobj.html#unrepr-mode), """ The types that unrepr can work with are : strings, lists tuples None, True, False dictionaries, integers, floats longs and complex numbers You can't store classes, types or instances. """ Pickle -- and the simplification API Tomer is proposing -- is far more general than that. repr/unrepr is in no way a substitute. Collin Winter

"Collin Winter" <collinw@gmail.com> wrote:
*I* use pickle when I want speed. I use repr and unrepr (from configobj) when I want to be able to change things by hand. None of the objects I transfer between processes/disk/whatever are ever more than base Python types. As for whatever *others* do, I don't know, I haven't done a survey. They may do as you say.
Pickle -- and the simplification API Tomer is proposing -- is far more general than that. repr/unrepr is in no way a substitute.
I never claimed it was a substitute, I stated quite clearly up front; "My only criticism will then be the strawman repr/unrepr." The alternate strawman is repr/eval, can be made to support basically everything except for self-referential objects. It still has that horrible security issue, but that also exists in pickle, and could still exist in the simplification/rebuilding rutines specified by Tomer. - Josiah

On 1/26/07, Josiah Carlson <jcarlson@uci.edu> wrote:
well, since this proposal originated from an RPC point-of-view, i, for once, also need to transfer full-blown user-defined objects and classes. for that, i need to be able to get the full state of the object, and the means to reconstruct it later. pickling the "primitives" was never a problem, because they have a well defined interface. although cyclic references call for something stronger than repr... and this is what i mean by "repr is for humans".
well, pickle is unsafe for one main reason -- it's omnipotent. it performs arbitrary imports and object instantiation, which are equivalent to eval of arbitrary strings. BUT, it has nothing to do with the way it gets of sets the *state* of objects. to solve that, we can have a "capability-based pickle": dumping objects was never a security issue, it's the loading part that's dangerous. we can add a new function, loadsec(), that takes both the string to load and a set of classes it may use to __rebuilt__ the object. capabilities = {"list" : list, "str" : str, "os.stat_result" : os.stat_result} loadsec(data, capabilities) that way, you can control the objects that will be instantiated, which you trust, so no arbitrary code may be executed behind the scenes. for the "classic" unsafe load(), we can pass a magical dict-like thing that imports names via __getitem__ if i had a way to control what pickle.loads has access to, i wouldn't need to write my own serializer.. http://sebulbasvn.googlecode.com/svn/trunk/rpyc/core/marshal/brine.py
but that's not the issue. when i send my objects across a socket back and forth, i want something that is fast, compact, and safe. i don't care for anyone sniffing my wire to "easily understand" what i'm sending... i mean, it's not like i encrypt the data, but readability doesn't count here. again, __simplify__ and __rebuild__ offer a mechanism for serializer-implementors, which may choose different encoding schemes for their internal purposes. this API isn't meant for the end user. i'll write a pre-pep to try to clarify it all in an orderly manner. -tomer

there's a bug in the copy function that i wanted to fix: def copy(obj): cls, state = obj.__getstate__() return cls.__setstate__(state) also, but there's a reason why __getstate__ returns "(cls, state)" rather than just "state", and that's to keep things agile. i don't want to be necessarily tightly-coupled to a certain type. the cls will be used to reconstruct the object (cls.__setstate__), much like the function returned by __reduce__, so normally, __getstate__ would just return self.__class__ for cls. but there are times, especially when object proxies are involved, that we may want to "lie" about the actual type, i.e., use a different type to reconstruct the object with. here's an example that shows why: class ListProxy: ... def __getstate__(self): return list, self._value instances of ListProxy, when stored in a file (i.e., shelf), want to be pickled by value. moreover, when they are loaded from a file, they want to loaded as actual lists, not proxies, as the proxied object is long lost. so returning a tuple of (cls, state) gives more freedom to frameworks and other utilities. of course, again, most code would just return (self.__class__, state) -tomer

On 1/25/07, Collin Winter <collinw@gmail.com> wrote:
How will e.g. classes be simplified? Can I simplify a dictionary with function objects for values?
well, pickle just saves them as a global name (modulename.classname). so types would generally just return themselves as primitives, and let the actual simplifier do the trick. it may choose to save the type's dict, or just a global name. that's up to the serializer-dependent simplifier. it's good you mentioned that, because it reminded me of something i forgot. for instance, code objects will be serialized by value, so you could actually pickle functions and classes. this means pyc files could become just a pickle of the module, i.e.: import foo pickle.dump(foo, open("foo.pyc", "w")) but again, that's up to the serializer. an enhanced pickle could do that. -tomer

On 1/25/07, tomer filiba <tomerfiliba@gmail.com> wrote:
Are you intending to simplify code objects to the co_* attributes? co_code is a string of interpreter-specific bytecode, which would be pretty much useless outside of a copy() function. Whatever the scheme, it will need to take into account cell objects and globals. Collin Winter

"tomer filiba" <tomerfiliba@gmail.com> wrote:
I presume you mean... def copy(obj): typ, state = obj.__getstate__() return typ.__setstate__(state)
Overall, I like the idea; I'm a big fan of simplifying object persistence and/or serialization. A part of me also likes how the objects can choose to lie about their types. But another part of me says; the basic objects that you specified already have a format that is unambiguous, repr(obj). They also are able to be reconstructed from their component parts via eval(repr(obj)), or even via the 'unrepr' function in the ConfigObj module. It doesn't handle circular referencse. Even better, it has 3 native representations; repr(a).encode('zlib'), repr(a), pprint.pprint(a); each offering a different amount of user readability. I digress. I believe the biggest problem with the proposal, as specified, is that changing the semantics of __getstate__ and __setstate__ is a bad idea. Add a new pair of methods and ask the twisted people what they think. My only criticism will then be the strawman repr/unrepr. - Josiah

On 1/25/07, Josiah Carlson <jcarlson@uci.edu> wrote:
well, repr is fine for most simple things, but you don't use repr to serialize objects, right? it's not powerful/introspective enough. besides repr is meant to be readable, while __getstate__ can return any object. imagine this: class complex: def __repr__(self): return "(%f+%fj)" % (self.real, self.imag) def __getstate__(self): return self.__class__, (self.real, self.imag) repr is made for humans of course, while serialization is made for machines. they serves different purposes, so they need different APIs.
you may have digressed, but that's a good point -- that's exactly why i do NOT specify how objects are encoded as a stream of bytes. all i'm after is the state of the object (which is expressed in terms of other, more primitive objects). you can think of repr as a textual serializer to some extent, that can use the proposed __getstate__ API. pprint is yet another form of serializer.
i'll try to come up with new names... but i don't have any ideas at the moment. -tomer

On 1/25/07, tomer filiba <tomerfiliba@gmail.com> wrote:
The "__getstate__" and "__setstate__" names don't really work for me either, especially since __setstate__ creates a new object, as opposed to changing the state of an existing object. Since this proposal is all about simplification, how about something like "__simplify__" and "__expand__", respectively? Collin Winter

On 1/26/07, Collin Winter <collinw@gmail.com> wrote:
well, i like __simplify__, but __expand__ seems wrong to me. some other suggestions i looked up: __complicate__ (:-)) __rebuild__ __reconstruct__ (too long?) __restore__ (but that's semantically the same as __setstate__) i'd vote for __rebuild__. any other ideas? -tomer

On Fri, Jan 26, 2007 at 12:41:36AM +0200, tomer filiba wrote:
__(un)pickle__ __(de)serialize__ Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

"tomer filiba" <tomerfiliba@gmail.com> wrote:
I use repr to serialize objects all the time. ConfigObj is great when I want to handle python-based configuration information, and/or I don't want to worry about the security implications of 'eval(arbitrary string)', or 'import module'. With a proper __repr__ method, I can even write towards your API: class mylist(object): def __repr__(self): state = ... return 'mylist.__setstate__(%r)'%(state,)
I would use 'return "(%r+%rj)"% (self.real, self.imag)', but it doesn't much matter.
I happen to disagree. The only reason to use a different representation or API is if there are size and/or performance benefits to offering a machine readable vs. human readable format. I'm know that there are real performance advantages to using (c)Pickle over repr/unrepr, but I use it also so that I can change settings with notepad (as has been necessary on occasion).
Right, but as 'primative objects' go, you cant get significantly more primitive than producing a string that can be naively understood by someone familliar with Python *and* the built-in Python parser. Nevermind that it works *today* with all of the types you specified earlier (with the exception of file objects - which you discover on parsing/reproducing the object).
Well, pprint is more or less a pretty repr.
Like Colin, I also like __rebuild__. - Josiah

On 1/25/07, Josiah Carlson <jcarlson@uci.edu> wrote:
You use pickle because it's more general than repr/unrepr, not because it's faster or the result is smaller. Assuming you're talking about the ConfigObj's unrepr mode (http://www.voidspace.org.uk/python/configobj.html#unrepr-mode), """ The types that unrepr can work with are : strings, lists tuples None, True, False dictionaries, integers, floats longs and complex numbers You can't store classes, types or instances. """ Pickle -- and the simplification API Tomer is proposing -- is far more general than that. repr/unrepr is in no way a substitute. Collin Winter

"Collin Winter" <collinw@gmail.com> wrote:
*I* use pickle when I want speed. I use repr and unrepr (from configobj) when I want to be able to change things by hand. None of the objects I transfer between processes/disk/whatever are ever more than base Python types. As for whatever *others* do, I don't know, I haven't done a survey. They may do as you say.
Pickle -- and the simplification API Tomer is proposing -- is far more general than that. repr/unrepr is in no way a substitute.
I never claimed it was a substitute, I stated quite clearly up front; "My only criticism will then be the strawman repr/unrepr." The alternate strawman is repr/eval, can be made to support basically everything except for self-referential objects. It still has that horrible security issue, but that also exists in pickle, and could still exist in the simplification/rebuilding rutines specified by Tomer. - Josiah

On 1/26/07, Josiah Carlson <jcarlson@uci.edu> wrote:
well, since this proposal originated from an RPC point-of-view, i, for once, also need to transfer full-blown user-defined objects and classes. for that, i need to be able to get the full state of the object, and the means to reconstruct it later. pickling the "primitives" was never a problem, because they have a well defined interface. although cyclic references call for something stronger than repr... and this is what i mean by "repr is for humans".
well, pickle is unsafe for one main reason -- it's omnipotent. it performs arbitrary imports and object instantiation, which are equivalent to eval of arbitrary strings. BUT, it has nothing to do with the way it gets of sets the *state* of objects. to solve that, we can have a "capability-based pickle": dumping objects was never a security issue, it's the loading part that's dangerous. we can add a new function, loadsec(), that takes both the string to load and a set of classes it may use to __rebuilt__ the object. capabilities = {"list" : list, "str" : str, "os.stat_result" : os.stat_result} loadsec(data, capabilities) that way, you can control the objects that will be instantiated, which you trust, so no arbitrary code may be executed behind the scenes. for the "classic" unsafe load(), we can pass a magical dict-like thing that imports names via __getitem__ if i had a way to control what pickle.loads has access to, i wouldn't need to write my own serializer.. http://sebulbasvn.googlecode.com/svn/trunk/rpyc/core/marshal/brine.py
but that's not the issue. when i send my objects across a socket back and forth, i want something that is fast, compact, and safe. i don't care for anyone sniffing my wire to "easily understand" what i'm sending... i mean, it's not like i encrypt the data, but readability doesn't count here. again, __simplify__ and __rebuild__ offer a mechanism for serializer-implementors, which may choose different encoding schemes for their internal purposes. this API isn't meant for the end user. i'll write a pre-pep to try to clarify it all in an orderly manner. -tomer
participants (4)
-
Collin Winter
-
Josiah Carlson
-
Oleg Broytmann
-
tomer filiba