If A.M. Kuchling's list of Python Warts is any indication, Python has
removed many of the warts it once had. However, the behavior of mutable
default argument values is still a frequent stumbling-block for newbies.
It is also present on at least 3 different lists of Python's
deficiencies ([0][1][2]).
Example of current, unintuitive behavior (snipped from [0]):
>>> def popo(x=[]):
... x.append(666)
... print x
...
>>> popo()
[666]
>>> popo()
[666, 666]
>>> popo()
[666, 666, 666]
Whereas a newbie with experience with immutable default argument values
would, by analogy, expect:
>>> popo()
[666]
>>> popo()
[666]
>>> popo()
[666]
In scanning [0], [1], [2], and other similar lists, I have only found
one mediocre use-case for this behavior: Using the default argument
value to retain state between calls. However, as [2] comments, this
purpose is much better served by decorators, classes, or (though less
preferred) global variables. Other uses are alluded to be equally
esoteric and unpythonic.
To work around this behavior, the following idiom is used:
def popo(x=None):
if x is None:
x = []
x.append(666)
print x
However, why should the programmer have to write this extra boilerplate
code when the current, unusual behavior is only relied on by 1% of
Python code?
Therefore, I propose that default arguments be handled as follows in Py3K:
1. The initial default value is evaluated at definition-time (as in the
current behavior).
2. That in a function call where the caller has not specified a value
for an optional argument, Python calls
copy.deepcopy(initial_default_value), and fills in the optional argument
with the resulting value.
This is fully backwards-compatible with the aforementioned workaround,
and removes the need for the it, allowing one to write the first,
simpler definition of popo().
Comments?
- Chris Rebert
[0] 10 Python pitfalls (http://zephyrfalcon.org/labs/python_pitfalls.html)
[1] Python Gotchas
(http://www.ferg.org/projects/python_gotchas.html#contents_item_6)
[2] When Pythons Attack
(http://www.onlamp.com/pub/a/python/2004/02/05/learn_python.html?page=2)
I make frequent use of the python's built-in debugger, which I think is
brilliant in its simplicity. However an important feature seems to be
missing: bash-like tab completion similar to that provided by the
rlcompleter module.
By default, Pdb and other instances of Cmd complete names for commands only.
However in the context of pdb, I think it is more useful to complete
identifiers and keywords in its current scope than to complete names of
commands (most of which have single letter abbreviations). I believe this
makes pdb a far more usable introspection tool.
Implementation:
I've attached a patch to pdb.py (on Python 2.4.4c1). The only real
difference to rlcompleter's default complete method is that because pdb
changes scope as you step through a program, rlcompleter's namespace is
updated to reflect the current local and global namespace.
This is my first attempt at a python patch. Any suggestions or improvements
are welcome.
Stephen Emslie
The following is a proto-PEP based on the discussion in the thread
"fixing mutable default argument values". Comments would be greatly
appreciated.
- Chris Rebert
Title: Fixing Non-constant Default Arguments
Abstract
This PEP proposes new semantics for default arguments to remove
boilerplate code associated with non-constant default argument values,
allowing them to be expressed more clearly and succinctly.
Motivation
Currently, to write functions using non-constant default arguments,
one must use the idiom:
def foo(non_const=None):
if non_const is None:
non_const = some_expr
#rest of function
or equivalent code. Naive programmers desiring mutable default arguments
often make the mistake of writing the following:
def foo(mutable=some_expr_producing_mutable):
#rest of function
However, this does not work as intended, as
'some_expr_producing_mutable' is evaluated only *once* at
definition-time, rather than once per call at call-time. This results
in all calls to 'foo' using the same default value, which can result in
unintended consequences. This necessitates the previously mentioned
idiom. This unintuitive behavior is such a frequent stumbling block for
newbies that it is present in at least 3 lists of Python's problems [0]
[1] [2].
There are currently few, if any, known good uses of the current
behavior of mutable default arguments. The most common one is to
preserve function state between calls. However, as one of the lists [2]
comments, this purpose is much better served by decorators, classes, or
(though less preferred) global variables.
Therefore, since the current semantics aren't useful for
non-constant default values and an idiom is necessary to work around
this deficiency, why not change the semantics so that people can write
what they mean more directly, without the annoying boilerplate?
Rationale
Originally, it was proposed that all default argument values be
deep-copied from the original (evaluated at definition-time) at each
invocation of the function where the default value was required.
However, this doesn't take into account default values that are not
literals, e.g. function calls, subscripts, attribute accesses. Thus,
the new idea was to re-evaluate the default arguments at each call where
they were needed. There was some concern over the possible performance
hit this could cause, and whether there should be new syntax so that
code could use the existing semantics for performance reasons. Some of
the proposed syntaxes were:
def foo(bar=<baz>):
#code
def foo(bar=new baz):
#code
def foo(bar=fresh baz):
#code
def foo(bar=separate baz):
#code
def foo(bar=another baz):
#code
def foo(bar=unique baz):
#code
where the new keyword (or angle brackets) would indicate that the
parameter's default argument should use the new semantics. Other
parameters would continue to use the old semantics. It was generally
agreed that the angle-bracket syntax was particularly ugly, leading to
the proposal of the other syntaxes. However, having 2 different sets of
semantics could be confusing and leaving in the old semantics just for
performance might be premature optimization. Refactorings to deal with
the possible performance hit are discussed below.
Specification
The current semantics for default arguments are replaced by the
following semantics:
- Whenever a function is called, and the caller does not provide a
value for a parameter with a default expression, the parameter's
default expression shall be evaluated in the function's scope. The
resulting value shall be assigned to a local variable in the
function's scope with the same name as the parameter.
- The default argument expressions shall be evaluated before the
body of the function.
- The evaluation of default argument expressions shall proceed in
the same order as that of the parameter list in the function's
definition.
Given these semantics, it makes more sense to refer to default argument
expressions rather than default argument values, as the expression is
re-evaluated at each call, rather than just once at definition-time.
Therefore, we shall do so hereafter.
Demonstrative examples of new semantics:
#default argument expressions can refer to
#variables in the enclosing scope...
CONST = "hi"
def foo(a=CONST):
print a
>>> foo()
hi
>>> CONST="bye"
>>> foo()
bye
#...or even other arguments
def ncopies(container, n=len(container)):
return [container for i in range(n)]
>>> ncopies([1, 2], 5)
[[1, 2], [1, 2], [1, 2], [1, 2], [1, 2]]
>>> ncopies([1, 2, 3])
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
>>> #ncopies grabbed n from [1, 2, 3]'s length (3)
#default argument expressions are arbitrary expressions
def my_sum(lst):
cur_sum = lst[0]
for i in lst[1:]: cur_sum += i
return cur_sum
def bar(b=my_sum((["b"] * (2 * 3))[:4])):
print b
>>> bar()
bbbb
#default argument expressions are re-evaluated at every call...
from random import randint
def baz(c=randint(1,3)):
print c
>>> baz()
2
>>> baz()
3
#...but only when they're required
def silly():
print "spam"
return 42
def qux(d=silly()):
pass
>>> qux()
spam
>>> qux(17)
>>> qux(d=17)
>>> qux(*[17])
>>> qux(**{'d':17})
>>> #no output because silly() never called because d's value was
specified in the calls
#Rule 3
count = 0
def next():
global count
count += 1
return count - 1
def frobnicate(g=next(), h=next(), i=next()):
print g, h, i
>>> frobnicate()
0 1 2
>>> #g, h, and i's default argument expressions are evaluated in
the same order as the parameter definition
Backwards Compatibility
This change in semantics breaks all code which uses mutable default
argument values. Such code can be refactored from:
def foo(bar=mutable):
#code
to
def stateify(state):
def _wrap(func):
def _wrapper(*args, **kwds):
kwds['bar'] = state
return func(*args, **kwds)
return _wrapper
return _wrap
@stateify(mutable)
def foo(bar):
#code
or
state = mutable
def foo(bar=state):
#code
or
class Baz(object):
def __init__(self):
self.state = mutable
def foo(self, bar=self.state):
#code
The changes in this PEP are backwards-compatible with all code whose
default argument values are immutable, including code using the idiom
mentioned in the 'Motivation' section. However, such values will now be
recomputed for each call for which they are required. This may cause
performance degradation. If such recomputation is significantly
expensive, the same refactorings mentioned above can be used.
In relation to Python 3.0, this PEP's proposal is compatible with
those of PEP 3102 [3] and PEP 3107 [4]. Also, this PEP does not depend
on the acceptance of either of those PEPs.
Reference Implementation
All code of the form:
def foo(bar=some_expr, baz=other_expr):
#body
Should act as if it had read (in pseudo-Python):
def foo(bar=_undefined, baz=_undefined):
if bar is _undefined:
bar = some_expr
if baz is _undefined:
baz = other_expr
#body
where _undefined is the value given to a parameter when the caller
didn't specify a value for it. This is not intended to be a literal
translation, but rather a demonstration as to how Python's internal
argument-handling machinery should be changed.
References
[0] 10 Python pitfalls
http://zephyrfalcon.org/labs/python_pitfalls.html
[1] Python Gotchas
http://www.ferg.org/projects/python_gotchas.html#contents_item_6
[2] When Pythons Attack
http://www.onlamp.com/pub/a/python/2004/02/05/learn_python.html?page=2
[3] Keyword-Only Arguments
http://www.python.org/dev/peps/pep-3102/
[4] Function Annotations
http://www.python.org/dev/peps/pep-3107/
In order to resolve a path conflict where I'm working on several copies of the
same package. I found it useful to add the following near the top of modules in
a package or sub package.
Module in package:
import sys
sys.path = ['..'] + sys.path
import package.module # Imports module in "this!" package.
Note: There could still be conflicts if a module with the same name is in the
same directory as the package. But that's much less likely than one in the rest
of the path.
Module in sub-package:
import sys
sys.path = ['../..'] + sys.path
import package.subpackage.module # finds "self" (subpackage) reliably.
By explicitly adding the packages parent directory to the *front* of sys.path it
resolves cases where imports using absolute imports, import modules from another
package because they are found first in the search path.
Adding this tip to the documentation some where would be nice. (providing there
is no major surprising side effects.) Of course I may have missed some obvious
way to do this. If so, it wasn't in an obvious place to be found. I looked. ;-)
----------------------------------
It might be useful to have a built-in function to do this. A function could
also check for __init__ files and raise errors if they are missing.
set_package_name(dotted.name) # Replaces import sys & path modification
Where dotted.name is the full package + sub-package name the current module is
located in. The function would search upwards to get the root package directory
and add that to the *front* of sys.path.
Module in package:
set_package_name('package') # Add parent directory to front of sys.path
import packagename.module # Finds module in "this!" package reliably.
Module in subpackage:
set_package_name('package.subpackage')
import package.subpackage.module # Finds "self" (subpackage) reliably.
----------------------------------
It may also be able to modify the import behavior to allow relative imports to
work when the module is run as script.
set_package_name('package')
from . import module1 # Imports modules from "this" package.
from . import module2
Currently an exception is raised you try to run a module with relative
references as a script.
ValueError: Attempted relative import in non-package
I think it is very handy to be able to run tests as scripts and keep them in a
sub-package. Especially while I'm writing them.
Another benefit of using relative imports with an absolute specified package
name, is if you rename a package or relocate a submodule, you only have one line
to change.
Cheers,
Ron
I think that this problem can be solved by the following change of
default argument behavior:
default arguments are evaluated in definition time (like in current
implementation), but right after being evaluated, result object is
checked if it's mutable (for example by checking of presence __copy__
special method or being instance of built in (sub)class
list/dict/set), if object is mutable, argument is marked by
COPY_DEF_ARG flag.
There are two reasons for this check being done there:
1. performance
2. it can be controlled by "from __future__ import ..." statement in
per-file manner
Then if default argument is needed in function call, first
COPY_DEF_ARG flag is checked if not set, default argument behaves
exactly like in current implementation, if the flag is set, it's
shallow copy is used instead.
Adding following classes/functions to stdlib allow reproduce old
behavior as well as add some new.
class IterateDefaultArg(object):
def __init__(self, iterator):
self.__iterator = iterator
def __copy__(self):
return self.__iterator.next()
class DefaultArgWrapper(object):
def __init__(self, generatorfunc):
self.__generatorfunc = generatorfunc
def __call__(self, *args, **kwargs):
return DefaultArgObject(self.__generatorfunc(*args, **kwargs))
@DefaultArgWrapper
def nocopyarg(obj):
while 1:
yield obj
With this current definition like:
def foo(cache = {}):
...
need to be replaced by:
def foo(cache = nocopyarg({})):
...
If one want to use deep copy instead copy, it might be done like this:
@DefaultArgWrapper
def deepcopyarg(obj):
from copy import deepcopy
while 1:
yield deepcopy(obj)
def foo(x = deepcopyarg(<some expression>)):
...
P.S. sorry for my bad English
--
闇に隠れた黒い力
弱い心を操る
On 1/30/07, Eduardo EdCrypt O. Padoan <eopadoan(a)altavix.com> wrote:
> > If I have time and figure out the right regexes I'll try and come up with
> > some more numbers on the entire stdlib, and the ammount of uses of =None.
> >
>
> Some uses of the spam=[] and ham=None in Python projects, including
> Python itself:
>
> http://www.google.com/codesearch?q=def.*%5C(.*%3D%5C%5B%5C%5D.*%5C)%3A%20la…
> http://www.google.com/codesearch?hl=en&lr=&q=def.*%5C%28.*%3DNone.*%5C%29%3…
>
> In this second search, I need a way to search that, in the body of the
> function, we have something like "if foo is not None: foo = []" (and
> foo = {} too)
>
> --
> EduardoOPadoan (eopadoan->altavix::com)
> Bookmarks: http://del.icio.us/edcrypt
> Blog: http://edcrypt.blogspot.com
> Jabber: edcrypt at jabber dot org
> ICQ: 161480283
> GTalk: eduardo dot padoan at gmail dot com
> MSN: eopadoan at altavix dot com
>
--
EduardoOPadoan (eopadoan->altavix::com)
Bookmarks: http://del.icio.us/edcrypt
Blog: http://edcrypt.blogspot.com
Jabber: edcrypt at jabber dot org
ICQ: 161480283
GTalk: eduardo dot padoan at gmail dot com
MSN: eopadoan at altavix dot com
there really is no need to have something like that built into the language.
most default arguments are immutable anyhow, and i'd guess most of the
mutable defaults can be addressed with the suggested @copydefaults
decorator.
as for uncopyable or stateful defaults (dev = sys.stdout), which require
reevaluation,
you can just use this modified version of my first suggested decorator:
>>> def reeval(**kwdefs):
... def deco(func):
... def wrapper(*args, **kwargs):
... for k, v in kwdefs.iteritems():
... if k not in kwargs:
... kwargs[k] = v() # <--- this is the big
change
... return func(*args, **kwargs)
... return wrapper
... return deco
...
the defaults are now provided as *functions*, which are evaluated at the
time of calling the function.
>>> @reeval(device = lambda: sys.stdout)
... def say(text, device):
... device.write("%s: %s\n" % (device.name, text))
... device.flush()
this means you can do things like --
>>> say("hello1")
<stdout>: hello1
>>>
>>> say("hello2", device = sys.stderr)
<stderr>: hello2
>>>
>>> sys.stdout = sys.stderr
>>> say("hello3")
<stderr>: hello3
>>>
>>> sys.stdout = sys.__stdout__
>>> say("hello4")
<stdout>: hello4
decorators are powerful enough for this not-so-common case.
it would be nice to have a collection of useful decorators as part of stdlib
(such as these ones, but including memoize and many others)...
but that's a different issue.
maybe we should really start a list of useful decorators to be included
as an stdlib module.
-tomer
this text needs more thought and rephrasing, but i think that
it covers all the details.
--------------------------------------------------------------------------
Abstract
=========
The simplification API provides a mechanism to get the state
of an object, as well as reconstructing an object with a given state.
This new API would affect pickle, copy_reg, and copy modules,
as well as the writers of serializers.
The idea is to separate the state-gathering from the encoding:
* State-gathering is getting the state of the object
* Encoding is the converting the state (which is an object) into
a sequence of bytes (textual or binary)
This is somewhat similar to the ISerializable interface of .NET,
which defines only GetObjectData(); the actual encoding is done
by a Formatter class, which has different implementations for
SOAP-formatting, binary-formatting, and other formats.
Motivation
==========
There are many generic and niche serializers out there, including
pickle, banana/jelly, Cerealizer, brine, and lots of serializers
targeting XML output. Currently, all serializers have their own
methods of retrieving the contents, or state of the object.
This API attempts to solve this issue by providing standard
means for getting object state, and creating objects with
their state restored.
Another issue is making the serialization process "proxy-friendly".
Many frameworks use object proxies to indirectly refer to another
object (for instance, RPC proxies, FFI, etc.). In this case, it's desirable
to simplify the referenced object rather than the proxy, and this API
addresses this issue too.
Simplification
==============
Simplification is the process of converting a 'complex' object
into its "atomic" components. You may think of these atomic
components as the *contents* of the object.
This proposal does not state what "atomic" means -- this is
open to the decision of the class. The only restriction imposed is,
collections of any kind must be simplified as tuples.
Moreover, the simplification process may be recursive: object X
may simplify itself in terms of object Y, which in turn may go
further simplification.
Simplification Protocol
=======================
This proposal introduces two new special methods:
def __simplify__(self):
return type, state
@classmethod
def __rebuild__(cls, state):
return new_instance_of_cls_with_given_state
__simplify__ takes no arguments (bar 'self'), and returns a tuple
of '(type, state)', representing the contents of 'self':
* 'type' is expected to be a class or a builtin type, although
it can be any object that has a '__rebuild__' method.
This 'type' will be used later to reconstruct the object
(using 'type.__rebuild__(state)').
* 'state' is any other that represents the inner state of 'self',
in a simplified form. This can be an atomic value, or yet another
complex object, that may go further simplification.
__rebuild__ is a expected to be classmethod that takes the
state returned by __simplify__, and returns a new instance of
'cls' with the given state.
If a type does not wish to be simplified, it may throw a TypeError
in its __simplify__ method; however, this is not recommended.
Types that want to be treated as atomic elements, such as file,
should just return themselves, and let the serializer handle them.
Default Simplification
=====================
All the built in types would grow a __simplify__ and __rebuild__
methods, which would follow these guidelines:
Primitive types (int, str, float, ...) are considered atomic.
Composite types (I think 'complex' is the only type), are
broken down into their components. For the complex type,
that would be a tuple of (real, imaginary).
Container types (tuples, lists, sets, dicts) represent themselves
as tuples of items. For example, dicts would be simplified
according to this pseudocode:
def PyDict_Simplifiy(PyObject * self):
return PyDictType, tuple(self.items())
Built in types would be considered atomic. User-defined classes
can be simplified into their metaclass, __bases__, and __dict__.
The type 'object' would simplify instances by returning their
__dict__ and any __slots__ the instance may have. This is
the default behavior; classes that desire a different behavior
would override __simplify__ and __rebuild__.
Example of default behavior:
>>> class Foo(object):
... def __init__(self):
... self.a = 5
... self.b = "spam"
...
>>> f = Foo()
>>> cls, state = f.__simplify__()
>>> cls
<class '__main__.Foo'>
>>> state
{"a" : 5, "b" : "spam"}
>>> shallow_copy = cls.__rebuild__(state)
>>> state.__simplify__()
(<type 'dict'>, (("a", 5), ("b", "spam")))
Example of customized behavior
>>> class Bar(object):
... def __init__(self):
... self.a = 5
... self.b = "spam"
... def __simplify__(self):
... return Bar, 17.5
... @clasmethod
... def __rebuild__(cls, state)
... self = cls.__new__(cls)
... if state == 17.5:
... self.a = 5
... self.b = "spam"
... return self
...
>>> b = Bar()
>>> b.__simplify__()
(<class '__main__.Bar'>, 17.5)
Code objects
=============
I wish that modules, classes and functions would also be simplifiable,
however, there are some issues with that:
* How to serialize code objects? These can be simplified as tuple of
their co_* attributes, but these attributes are very implementation-
specific.
* How to serialize cell variables, or other globals?
It would be nice if .pyc files where generated like so:
import foo
pickle.dump(foo)
It would also allow sending of code between machines, just
like any other object.
Copying
========
Shallow copy, as well as deep copy, can be implemented using
the semantics of this new API. The copy module should be rewritten
accordingly:
def copy(obj):
cls, state = obj.__simplify__()
return cls.__rebuild__(state)
deepcopy() can be implemented similarly.
Deprecation
===========
With the proposed API, copy_reg, __reduce__, __reduce_ex__,
and possibly other modules become deprecated.
Apart from that, the pickle and copy modules need to be updated
accordingly.
C API
======
The proposal introduces two new C-API functions:
PyObject * PyObject_Simplify(PyObject * self);
PyObject * PyObject_Rebuild(PyObject * type, PyObject * state);
Although this is only a suggestion. I'd like to hear ideas from someone
with more experience in the core.
I do not see a need for a convenience routine such as
simplify(obj) <--> obj.__simplify__(), since this API is not designed
for everyday usage. This is the case with __reduce__ today.
Object Proxying
===============
Because __simplify__ returns '(type, state)', it may choose to "lie"
about it's actual type. This means that when the state is reconstructed,
another type is used. Object proxies will use this mechanism to serialize
the referred object, rather than the proxy.
class Proxy(object):
[...]
def __simplify__(self):
# this returns a tuple with the *real* type and state
return self.value.__simplify__()
Serialization
============
Serialization is the process of converting fully-simplified objects into
byte sequences (strings). Fully simplified objects are created by a
recursive simplifier, that simplifies the entire object graph into atomic
components. Then, the serializer would convert the atomic components
into strings.
Note that this proposal does not define how atomic objects are to be
converted to strings, or how a 'recursive simplifier' should work. These
issues are to be resolved by the implementation of the serializer.
For instance, file objects are atomic; one serializer may be able to
handle them, by storing them as (filename, file-mode, file-position),
while another may not be, so it would raise an exception.
Recursive Simplifier
===================
This code demonstrates the general idea of how recursive simplifiers
may be implemented:
def recursive_simplifier(obj):
cls, state = obj.__simplify__()
# simplify all the elements inside tuples
if type(state) is tuple:
nested_state = []
for item in state:
nested_state.append(recursive_simplifier(item))
return cls, nested_state
# see if the object is atomic; if not, dig deeper
if (cls, state) == state.__simplify__():
# 'state' is an atomic object, no need to go further
return cls, state
else:
# this object is not atomic, so dig deeper
return cls, recusrive_simplifier(state)
-tomer
i thought this code be solved nicely with a decorator... it needs some
more work, but it would make a good cookbook recipe:
.>>> from copy import deepcopy.
.>>>
.>>> def defaults(**kwdefs):
... def deco(func):
... def wrapper(*args, **kwargs):
... for k,v in kwdefs.iteritems():
... if k not in kwargs:
... kwargs[k] = deepcopy(v)
... return func(*args, **kwargs)
... return wrapper
... return deco
...
.>>> @defaults(x = [])
... def foo(a, x):
... x.append(a)
... print x
...
.>>> foo(5)
[5]
.>>> foo(5)
[5]
.>>> foo(5)
[5]
maybe it should be done by copying func_defaults... then it could
be written as
@copydefaults
def f(a, b = 5, c = []):
...
-tomer
---------- Forwarded message ----------
From: tomer filiba <tomerfiliba(a)gmail.com>
Date: Jan 24, 2007 10:45 PM
Subject: new pickle semantics/API
To: Python-3000(a)python.org
i'm having great trouble in RPyC with pickling object proxies.
several users have asked for this feature, but no matter how hard
i try to "bend the truth", pickle always complains. it uses
type(obj) for the dispatching, which "uncovers" the object
is actually a proxy, rather than a real object.
recap: RPyC uses local proxies that refer to objects of a
remote interpreter (another process/machine).
if you'd noticed, every RPC framework has its own serializer.
for example banna/jelly in twisted and bunch of other
XML serializers, and what not.
for RPyC i wrote yet another serializer, but for different purposes,
so it's not relevant for the issue at hand.
what i want is a standard serialization *API*. the idea is that
any framework could make use of this API, and that it would
be generic enough to eliminate copy_reg and other misfortunes.
this also means the built in types should be familiarized with
this API.
- - - - - - - -
for example, currently the builtin types don't support __reduce__,
and require pickle to use it's own internal registry. moreover,
__reduce__ is very pickle-specific (i.e., it takes the protocol number).
what i'm after is an API for "simplifying" complex objects into
simpler parts.
here's the API i'm suggesting:
def __getstate__(self):
# return a tuple of (type(self), obj), where obj is a simplified
# version of self
@classmethod
def __setstate__(cls, state):
# return an instance of cls, with the given state
well, you may already know these two, although their
semantics are different. but wait, there's more!
the idea is of having the following simple building blocks:
* integers (int/long)
* strings (str)
* arrays (tuples)
all picklable objects should be able to express themselves
as a collection of these building blocks. of course this will be
recursive, i.e., object X could simplify itself as object Y,
where object Y might go further simplification, until we are
left with building blocks only.
for example:
* int - return self
* float - string in the format "[+-]X.YYYe[+-]EEE"
* complex - two floats
* tuple - tuple of its simplified elements
* list - tuple of its simplified elements
* dict - a tuple of (key, value) tuples
* set - a tuple of its items
* file - raises TypeError("can't be simplified")
all in all, i choose to call that *simplification* rather than
*serialization*,
as serialization is more about converting the simplified objects into a
sequence of bytes. my suggestion leaves that out for the
implementers of specific serializers.
so this is how a typical serializer (e.g., pickle) would be implemented:
* define its version of a "recursive simplifier"
* optionally use a "memo" to remember objects that were already
visited (so they would be serialized by reference rather than by value)
* define its flavor of converting ints, strings, and arrays to bytes
(binary, textual, etc. etc.)
- - - - - - - -
the default implementation of __getstate__, in object.__getstate__,
will simply return self.__dict__ and any self.__slots__
this removes the need for __reduce__, __reduce_ex__, and copy_reg,
and simplifies pickle greatly. it does require, however, adding
support for simplification for all builtin types... but this doesn't
call for much code:
def PyList_GetState(self):
state = tuple(PyObject_GetState(item) for item in self)
return PyListType, state
also note that it makes the copy module much simpler:
def copy(obj):
state = obj.__getstate__()
return type(obj).__setstate__(state)
- - - - - - - -
executive summary:
simplifying object serialization and copying by revising
__getstate__ and __setstate__, so that they return a
"simplified" version of the object.
this new mechanism should become an official API to
getting or setting the "contents" of objects (either builtin or
user-defined).
having this unified mechanism, pickling proxy objects would
work as expected.
if there's interest, i'll write a pep-like document to explain
all the semantics.
-tomer