simplification pep-like thing
this text needs more thought and rephrasing, but i think that it covers all the details. -------------------------------------------------------------------------- Abstract ========= The simplification API provides a mechanism to get the state of an object, as well as reconstructing an object with a given state. This new API would affect pickle, copy_reg, and copy modules, as well as the writers of serializers. The idea is to separate the state-gathering from the encoding: * State-gathering is getting the state of the object * Encoding is the converting the state (which is an object) into a sequence of bytes (textual or binary) This is somewhat similar to the ISerializable interface of .NET, which defines only GetObjectData(); the actual encoding is done by a Formatter class, which has different implementations for SOAP-formatting, binary-formatting, and other formats. Motivation ========== There are many generic and niche serializers out there, including pickle, banana/jelly, Cerealizer, brine, and lots of serializers targeting XML output. Currently, all serializers have their own methods of retrieving the contents, or state of the object. This API attempts to solve this issue by providing standard means for getting object state, and creating objects with their state restored. Another issue is making the serialization process "proxy-friendly". Many frameworks use object proxies to indirectly refer to another object (for instance, RPC proxies, FFI, etc.). In this case, it's desirable to simplify the referenced object rather than the proxy, and this API addresses this issue too. Simplification ============== Simplification is the process of converting a 'complex' object into its "atomic" components. You may think of these atomic components as the *contents* of the object. This proposal does not state what "atomic" means -- this is open to the decision of the class. The only restriction imposed is, collections of any kind must be simplified as tuples. Moreover, the simplification process may be recursive: object X may simplify itself in terms of object Y, which in turn may go further simplification. Simplification Protocol ======================= This proposal introduces two new special methods: def __simplify__(self): return type, state @classmethod def __rebuild__(cls, state): return new_instance_of_cls_with_given_state __simplify__ takes no arguments (bar 'self'), and returns a tuple of '(type, state)', representing the contents of 'self': * 'type' is expected to be a class or a builtin type, although it can be any object that has a '__rebuild__' method. This 'type' will be used later to reconstruct the object (using 'type.__rebuild__(state)'). * 'state' is any other that represents the inner state of 'self', in a simplified form. This can be an atomic value, or yet another complex object, that may go further simplification. __rebuild__ is a expected to be classmethod that takes the state returned by __simplify__, and returns a new instance of 'cls' with the given state. If a type does not wish to be simplified, it may throw a TypeError in its __simplify__ method; however, this is not recommended. Types that want to be treated as atomic elements, such as file, should just return themselves, and let the serializer handle them. Default Simplification ===================== All the built in types would grow a __simplify__ and __rebuild__ methods, which would follow these guidelines: Primitive types (int, str, float, ...) are considered atomic. Composite types (I think 'complex' is the only type), are broken down into their components. For the complex type, that would be a tuple of (real, imaginary). Container types (tuples, lists, sets, dicts) represent themselves as tuples of items. For example, dicts would be simplified according to this pseudocode: def PyDict_Simplifiy(PyObject * self): return PyDictType, tuple(self.items()) Built in types would be considered atomic. User-defined classes can be simplified into their metaclass, __bases__, and __dict__. The type 'object' would simplify instances by returning their __dict__ and any __slots__ the instance may have. This is the default behavior; classes that desire a different behavior would override __simplify__ and __rebuild__. Example of default behavior: >>> class Foo(object): ... def __init__(self): ... self.a = 5 ... self.b = "spam" ... >>> f = Foo() >>> cls, state = f.__simplify__() >>> cls <class '__main__.Foo'> >>> state {"a" : 5, "b" : "spam"} >>> shallow_copy = cls.__rebuild__(state) >>> state.__simplify__() (<type 'dict'>, (("a", 5), ("b", "spam"))) Example of customized behavior >>> class Bar(object): ... def __init__(self): ... self.a = 5 ... self.b = "spam" ... def __simplify__(self): ... return Bar, 17.5 ... @clasmethod ... def __rebuild__(cls, state) ... self = cls.__new__(cls) ... if state == 17.5: ... self.a = 5 ... self.b = "spam" ... return self ... >>> b = Bar() >>> b.__simplify__() (<class '__main__.Bar'>, 17.5) Code objects ============= I wish that modules, classes and functions would also be simplifiable, however, there are some issues with that: * How to serialize code objects? These can be simplified as tuple of their co_* attributes, but these attributes are very implementation- specific. * How to serialize cell variables, or other globals? It would be nice if .pyc files where generated like so: import foo pickle.dump(foo) It would also allow sending of code between machines, just like any other object. Copying ======== Shallow copy, as well as deep copy, can be implemented using the semantics of this new API. The copy module should be rewritten accordingly: def copy(obj): cls, state = obj.__simplify__() return cls.__rebuild__(state) deepcopy() can be implemented similarly. Deprecation =========== With the proposed API, copy_reg, __reduce__, __reduce_ex__, and possibly other modules become deprecated. Apart from that, the pickle and copy modules need to be updated accordingly. C API ====== The proposal introduces two new C-API functions: PyObject * PyObject_Simplify(PyObject * self); PyObject * PyObject_Rebuild(PyObject * type, PyObject * state); Although this is only a suggestion. I'd like to hear ideas from someone with more experience in the core. I do not see a need for a convenience routine such as simplify(obj) <--> obj.__simplify__(), since this API is not designed for everyday usage. This is the case with __reduce__ today. Object Proxying =============== Because __simplify__ returns '(type, state)', it may choose to "lie" about it's actual type. This means that when the state is reconstructed, another type is used. Object proxies will use this mechanism to serialize the referred object, rather than the proxy. class Proxy(object): [...] def __simplify__(self): # this returns a tuple with the *real* type and state return self.value.__simplify__() Serialization ============ Serialization is the process of converting fully-simplified objects into byte sequences (strings). Fully simplified objects are created by a recursive simplifier, that simplifies the entire object graph into atomic components. Then, the serializer would convert the atomic components into strings. Note that this proposal does not define how atomic objects are to be converted to strings, or how a 'recursive simplifier' should work. These issues are to be resolved by the implementation of the serializer. For instance, file objects are atomic; one serializer may be able to handle them, by storing them as (filename, file-mode, file-position), while another may not be, so it would raise an exception. Recursive Simplifier =================== This code demonstrates the general idea of how recursive simplifiers may be implemented: def recursive_simplifier(obj): cls, state = obj.__simplify__() # simplify all the elements inside tuples if type(state) is tuple: nested_state = [] for item in state: nested_state.append(recursive_simplifier(item)) return cls, nested_state # see if the object is atomic; if not, dig deeper if (cls, state) == state.__simplify__(): # 'state' is an atomic object, no need to go further return cls, state else: # this object is not atomic, so dig deeper return cls, recusrive_simplifier(state) -tomer
On Fri, Jan 26, 2007, tomer filiba wrote:
Default Simplification ===================== All the built in types would grow a __simplify__ and __rebuild__ methods, which would follow these guidelines:
Primitive types (int, str, float, ...) are considered atomic.
Composite types (I think 'complex' is the only type), are broken down into their components. For the complex type, that would be a tuple of (real, imaginary).
Container types (tuples, lists, sets, dicts) represent themselves as tuples of items. For example, dicts would be simplified according to this pseudocode:
def PyDict_Simplifiy(PyObject * self): return PyDictType, tuple(self.items())
Built in types would be considered atomic. User-defined classes can be simplified into their metaclass, __bases__, and __dict__.
This seems to contradict the following:
For instance, file objects are atomic; one serializer may be able to handle them, by storing them as (filename, file-mode, file-position), while another may not be, so it would raise an exception.
Where do files fit into all this? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "I disrespectfully agree." --SJM
Aahz:
Built in types would be considered atomic. User-defined classes can be simplified into their metaclass, __bases__, and __dict__.
This seems to contradict the following:
For instance, file objects are atomic; one serializer may be able to handle them, by storing them as (filename, file-mode, file-position), while another may not be, so it would raise an exception.
Where do files fit into all this?
it doesn't... it was just meant to show how different serializers may implement their converters for the atomic values. it's not a practical suggestion. -tomer
On 1/26/07, tomer filiba <tomerfiliba@gmail.com> wrote:
This proposal does not state what "atomic" means -- this is open to the decision of the class. The only restriction imposed is, collections of any kind must be simplified as tuples.
It is worth stating the minimum that a compliant serializer must be able to treat atomically. For instance, is it always sufficient to reduce state to instances of (builtin, not subclasses of) string, tuple, float, or int?
Simplification Protocol ======================= This proposal introduces two new special methods:
def __simplify__(self): return type, state
If I understand correctly, this returns the actual type object, rather than its name, or the source code to rebuild that type. Are there any requirements on what sort of type objects the serializer must be able to support?
I do not see a need for a convenience routine such as simplify(obj) <--> obj.__simplify__(), since this API is not designed for everyday usage. This is the case with __reduce__ today.
reduce is nasty for beginners. In fairness, I think much of the problem is that reduce and __reduce__ (as well as __reduce_ex__) both exist, but are unrelated. So adding simplify probably isn't required, but reserving the name might be.
Note that this proposal does not define how atomic objects are to be converted to strings, or how a 'recursive simplifier' should work. These issues are to be resolved by the implementation of the serializer.
Is there some minimum requirement on is-stability? For example, would the following relationships be preserved after deserialization? >>> a1=object() >>> a2=a1 >>> b=object() >>> a1 is a2 True >>> a1 is b False -jJ
participants (3)
-
Aahz
-
Jim Jewett
-
tomer filiba