[Python-ideas] simplification pep-like thing

tomer filiba tomerfiliba at gmail.com
Fri Jan 26 16:13:12 CET 2007


this text needs more thought and rephrasing, but i think that
it covers all the details.

--------------------------------------------------------------------------

Abstract
=========
The simplification API provides a mechanism to get the state
of an object, as well as reconstructing an object with a given state.
This new API would affect pickle, copy_reg, and copy modules,
as well as the writers of serializers.

The idea is to separate the state-gathering from the encoding:
* State-gathering is getting the state of the object
* Encoding is the converting the state (which is an object) into
  a sequence of bytes (textual or binary)

This is somewhat similar to the ISerializable interface of .NET,
which defines only GetObjectData(); the actual encoding is done
by a Formatter class, which has different implementations for
SOAP-formatting, binary-formatting, and other formats.

Motivation
==========
There are many generic and niche serializers out there, including
pickle, banana/jelly, Cerealizer,  brine, and lots of serializers
targeting XML output. Currently, all serializers have their own
methods of retrieving the contents, or state of the object.

This API attempts to solve this issue by providing standard
means for getting object state, and creating objects with
their state restored.

Another issue is making the serialization process "proxy-friendly".
Many frameworks use object proxies to indirectly refer to another
object (for instance, RPC proxies, FFI, etc.). In this case, it's desirable
to simplify the referenced object rather than the proxy, and this API
addresses this issue too.

Simplification
==============
Simplification is the process of converting a 'complex' object
into its "atomic" components. You may think of these atomic
components as the *contents* of the object.

This proposal does not state what "atomic" means -- this is
open to the decision of the class. The only restriction imposed is,
collections of any kind must be simplified as tuples.

Moreover, the simplification process may be recursive: object X
may simplify itself in terms of object Y, which in turn may go
further simplification.

Simplification Protocol
=======================
This proposal introduces two new special methods:

    def __simplify__(self):
        return type, state

    @classmethod
    def __rebuild__(cls, state):
        return new_instance_of_cls_with_given_state

__simplify__ takes no arguments (bar 'self'), and returns a tuple
of '(type, state)', representing the contents of 'self':
* 'type' is expected to be a class or a builtin type, although
  it can be any object that has a '__rebuild__' method.
  This 'type' will be used later to reconstruct the object
  (using 'type.__rebuild__(state)').
* 'state' is any other that represents the inner state of 'self',
  in a simplified form. This can be an atomic value, or yet another
  complex object, that may go further simplification.

__rebuild__ is a expected to be classmethod that takes the
state returned by __simplify__, and returns a new instance of
'cls' with the given state.

If a type does not wish to be simplified, it may throw a TypeError
in its __simplify__ method; however, this is not recommended.
Types that want to be treated as atomic elements, such as file,
should just return themselves, and let the serializer handle them.

Default Simplification
=====================
All the built in types would grow a __simplify__ and __rebuild__
methods, which would follow these guidelines:

Primitive types (int, str, float, ...) are considered atomic.

Composite types (I think 'complex' is the only type), are
broken down into their components. For the complex type,
that would be a tuple of (real, imaginary).

Container types (tuples, lists, sets, dicts) represent themselves
as tuples of items. For example, dicts would be simplified
according to this pseudocode:

    def PyDict_Simplifiy(PyObject * self):
        return PyDictType, tuple(self.items())

Built in types would be considered atomic. User-defined classes
can be simplified into their metaclass, __bases__, and __dict__.

The type 'object' would simplify instances by returning their
__dict__ and any __slots__ the instance may have. This is
the default behavior; classes that desire a different behavior
would override __simplify__ and __rebuild__.

Example of default behavior:
    >>> class Foo(object):
    ...     def __init__(self):
    ...         self.a = 5
    ...         self.b = "spam"
    ...
    >>> f = Foo()
    >>> cls, state = f.__simplify__()
    >>> cls
    <class '__main__.Foo'>
    >>> state
    {"a" : 5, "b" : "spam"}
    >>> shallow_copy = cls.__rebuild__(state)
    >>> state.__simplify__()
    (<type 'dict'>, (("a", 5), ("b", "spam")))

Example of customized behavior
    >>> class Bar(object):
    ...     def __init__(self):
    ...         self.a = 5
    ...         self.b = "spam"
    ...     def __simplify__(self):
    ...          return Bar, 17.5
    ...     @clasmethod
    ...     def __rebuild__(cls, state)
    ...          self = cls.__new__(cls)
    ...          if state == 17.5:
    ...              self.a = 5
    ...              self.b = "spam"
    ...          return self
    ...
    >>> b = Bar()
    >>> b.__simplify__()
    (<class '__main__.Bar'>, 17.5)


Code objects
=============
I wish that modules, classes and functions would also be simplifiable,
however, there are some issues with that:
* How to serialize code objects? These can be simplified as tuple of
  their co_* attributes, but these attributes are very implementation-
  specific.
* How to serialize cell variables, or other globals?

It would be nice if .pyc files where generated like so:
    import foo
    pickle.dump(foo)

It would also allow sending of code between machines, just
like any other object.

Copying
========
Shallow copy, as well as deep copy, can be implemented using
the semantics of this new API. The copy module should be rewritten
accordingly:
    def copy(obj):
        cls, state = obj.__simplify__()
        return cls.__rebuild__(state)

deepcopy() can be implemented similarly.

Deprecation
===========
With the proposed API, copy_reg, __reduce__, __reduce_ex__,
and possibly other modules become deprecated.

Apart from that, the pickle and copy modules need to be updated
accordingly.

C API
======
The proposal introduces two new C-API functions:
PyObject * PyObject_Simplify(PyObject * self);
PyObject * PyObject_Rebuild(PyObject * type, PyObject * state);

Although this is only a suggestion. I'd like to hear ideas from someone
with more experience in the core.

I do not see a need for a convenience routine such as
simplify(obj) <--> obj.__simplify__(), since this API is not designed
for everyday usage. This is the case with __reduce__ today.

Object Proxying
===============
Because __simplify__ returns '(type, state)', it may choose to "lie"
about it's actual type. This means that when the state is reconstructed,
another type is used. Object proxies will use this mechanism to serialize
the referred object, rather than the proxy.

class Proxy(object):
    [...]
    def __simplify__(self):
        # this returns a tuple with the *real* type and state
        return self.value.__simplify__()

Serialization
============
Serialization is the process of converting fully-simplified objects into
byte sequences (strings). Fully simplified objects are created by a
recursive simplifier, that simplifies the entire object graph into atomic
components. Then, the serializer would convert the atomic components
into strings.

Note that this proposal does not define how atomic objects are to be
converted to strings, or how a 'recursive simplifier' should work. These
issues are to be resolved by the implementation of the serializer.

For instance, file objects are atomic; one serializer may be able to
handle them, by storing them as (filename, file-mode, file-position),
while another may not be, so it would raise an exception.

Recursive Simplifier
===================
This code demonstrates the general idea of how recursive simplifiers
may be implemented:

def recursive_simplifier(obj):
    cls, state = obj.__simplify__()

    # simplify all the elements inside tuples
    if type(state) is tuple:
        nested_state = []
        for item in state:
            nested_state.append(recursive_simplifier(item))
        return cls, nested_state

    # see if the object is atomic; if not, dig deeper
    if (cls, state) == state.__simplify__():
        # 'state' is an atomic object, no need to go further
        return cls, state
    else:
        # this object is not atomic, so dig deeper
        return cls, recusrive_simplifier(state)




-tomer



More information about the Python-ideas mailing list