webmaster has already heard from 4 people who cannot install it.
I sent them to the bug tracker or to python-list but they seem
not to have gone either place. Is there some guide I should be
sending them to, 'how to debug installation problems'?
Laura
Hi,
I'd like to submit this PEP for discussion. It is quite specialized
and the main target audience of the proposed changes is
users and authors of applications/libraries transferring large amounts
of data (read: the scientific computing & data science ecosystems).
https://www.python.org/dev/peps/pep-0574/
The PEP text is also inlined below.
Regards
Antoine.
PEP: 574
Title: Pickle protocol 5 with out-of-band data
Version: $Revision$
Last-Modified: $Date$
Author: Antoine Pitrou <solipsis(a)pitrou.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 23-Mar-2018
Post-History:
Resolution:
Abstract
========
This PEP proposes to standardize a new pickle protocol version, and
accompanying APIs to take full advantage of it:
1. A new pickle protocol version (5) to cover the extra metadata needed
for out-of-band data buffers.
2. A new ``PickleBuffer`` type for ``__reduce_ex__`` implementations
to return out-of-band data buffers.
3. A new ``buffer_callback`` parameter when pickling, to handle out-of-band
data buffers.
4. A new ``buffers`` parameter when unpickling to provide out-of-band data
buffers.
The PEP guarantees unchanged behaviour for anyone not using the new APIs.
Rationale
=========
The pickle protocol was originally designed in 1995 for on-disk persistency
of arbitrary Python objects. The performance of a 1995-era storage medium
probably made it irrelevant to focus on performance metrics such as
use of RAM bandwidth when copying temporary data before writing it to disk.
Nowadays the pickle protocol sees a growing use in applications where most
of the data isn't ever persisted to disk (or, when it is, it uses a portable
format instead of Python-specific). Instead, pickle is being used to transmit
data and commands from one process to another, either on the same machine
or on multiple machines. Those applications will sometimes deal with very
large data (such as Numpy arrays or Pandas dataframes) that need to be
transferred around. For those applications, pickle is currently
wasteful as it imposes spurious memory copies of the data being serialized.
As a matter of fact, the standard ``multiprocessing`` module uses pickle
for serialization, and therefore also suffers from this problem when
sending large data to another process.
Third-party Python libraries, such as Dask [#dask]_, PyArrow [#pyarrow]_
and IPyParallel [#ipyparallel]_, have started implementing alternative
serialization schemes with the explicit goal of avoiding copies on large
data. Implementing a new serialization scheme is difficult and often
leads to reduced generality (since many Python objects support pickle
but not the new serialization scheme). Falling back on pickle for
unsupported types is an option, but then you get back the spurious
memory copies you wanted to avoid in the first place. For example,
``dask`` is able to avoid memory copies for Numpy arrays and
built-in containers thereof (such as lists or dicts containing Numpy
arrays), but if a large Numpy array is an attribute of a user-defined
object, ``dask`` will serialize the user-defined object as a pickle
stream, leading to memory copies.
The common theme of these third-party serialization efforts is to generate
a stream of object metadata (which contains pickle-like information about
the objects being serialized) and a separate stream of zero-copy buffer
objects for the payloads of large objects. Note that, in this scheme,
small objects such as ints, etc. can be dumped together with the metadata
stream. Refinements can include opportunistic compression of large data
depending on its type and layout, like ``dask`` does.
This PEP aims to make ``pickle`` usable in a way where large data is handled
as a separate stream of zero-copy buffers, letting the application handle
those buffers optimally.
Example
=======
To keep the example simple and avoid requiring knowledge of third-party
libraries, we will focus here on a bytearray object (but the issue is
conceptually the same with more sophisticated objects such as Numpy arrays).
Like most objects, the bytearray object isn't immediately understood by
the pickle module and must therefore specify its decomposition scheme.
Here is how a bytearray object currently decomposes for pickling::
>>> b.__reduce_ex__(4)
(<class 'bytearray'>, (b'abc',), None)
This is because the ``bytearray.__reduce_ex__`` implementation reads
morally as follows::
class bytearray:
def __reduce_ex__(self, protocol):
if protocol == 4:
return type(self), bytes(self), None
# Legacy code for earlier protocols omitted
In turn it produces the following pickle code::
>>> pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=4)))
0: \x80 PROTO 4
2: \x95 FRAME 30
11: \x8c SHORT_BINUNICODE 'builtins'
21: \x8c SHORT_BINUNICODE 'bytearray'
32: \x93 STACK_GLOBAL
33: C SHORT_BINBYTES b'abc'
38: \x85 TUPLE1
39: R REDUCE
40: . STOP
(the call to ``pickletools.optimize`` above is only meant to make the
pickle stream more readable by removing the MEMOIZE opcodes)
We can notice several things about the bytearray's payload (the sequence
of bytes ``b'abc'``):
* ``bytearray.__reduce_ex__`` produces a first copy by instantiating a
new bytes object from the bytearray's data.
* ``pickle.dumps`` produces a second copy when inserting the contents of
that bytes object into the pickle stream, after the SHORT_BINBYTES opcode.
* Furthermore, when deserializing the pickle stream, a temporary bytes
object is created when the SHORT_BINBYTES opcode is encountered (inducing
a data copy).
What we really want is something like the following:
* ``bytearray.__reduce_ex__`` produces a *view* of the bytearray's data.
* ``pickle.dumps`` doesn't try to copy that data into the pickle stream
but instead passes the buffer view to its caller (which can decide on the
most efficient handling of that buffer).
* When deserializing, ``pickle.loads`` takes the pickle stream and the
buffer view separately, and passes the buffer view directly to the
bytearray constructor.
We see that several conditions are required for the above to work:
* ``__reduce__`` or ``__reduce_ex__`` must be able to return *something*
that indicates a serializable no-copy buffer view.
* The pickle protocol must be able to represent references to such buffer
views, instructing the unpickler that it may have to get the actual buffer
out of band.
* The ``pickle.Pickler`` API must provide its caller with a way
to receive such buffer views while serializing.
* The ``pickle.Unpickler`` API must similarly allow its caller to provide
the buffer views required for deserialization.
* For compatibility, the pickle protocol must also be able to contain direct
serializations of such buffer views, such that current uses of the ``pickle``
API don't have to be modified if they are not concerned with memory copies.
Producer API
============
We are introducing a new type ``pickle.PickleBuffer`` which can be
instantiated from any buffer-supporting object, and is specifically meant
to be returned from ``__reduce__`` implementations::
class bytearray:
def __reduce_ex__(self, protocol):
if protocol == 5:
return type(self), PickleBuffer(self), None
# Legacy code for earlier protocols omitted
``PickleBuffer`` is a simple wrapper that doesn't have all the memoryview
semantics and functionality, but is specifically recognized by the ``pickle``
module if protocol 5 or higher is enabled. It is an error to try to
serialize a ``PickleBuffer`` with pickle protocol version 4 or earlier.
Only the raw *data* of the ``PickleBuffer`` will be considered by the
``pickle`` module. Any type-specific *metadata* (such as shapes or
datatype) must be returned separately by the type's ``__reduce__``
implementation, as is already the case.
PickleBuffer objects
--------------------
The ``PickleBuffer`` class supports a very simple Python API. Its constructor
takes a single PEP 3118-compatible object [#pep-3118]_. ``PickleBuffer``
objects themselves support the buffer protocol, so consumers can
call ``memoryview(...)`` on them to get additional information
about the underlying buffer (such as the original type, shape, etc.).
On the C side, a simple API will be provided to create and inspect
PickleBuffer objects:
``PyObject *PyPickleBuffer_FromObject(PyObject *obj)``
Create a ``PickleBuffer`` object holding a view over the PEP 3118-compatible
*obj*.
``PyPickleBuffer_Check(PyObject *obj)``
Return whether *obj* is a ``PickleBuffer`` instance.
``const Py_buffer *PyPickleBuffer_GetBuffer(PyObject *picklebuf)``
Return a pointer to the internal ``Py_buffer`` owned by the ``PickleBuffer``
instance.
``PickleBuffer`` can wrap any kind of buffer, including non-contiguous
buffers. It's up to consumers to decide how best to handle different kinds
of buffers (for example, some consumers may find it acceptable to make a
contiguous copy of non-contiguous buffers).
Consumer API
============
``pickle.Pickler.__init__`` and ``pickle.dumps`` are augmented with an additional
``buffer_callback`` parameter::
class Pickler:
def __init__(self, file, protocol=None, ..., buffer_callback=None):
"""
If *buffer_callback* is not None, then it is called with a list
of out-of-band buffer views when deemed necessary (this could be
once every buffer, or only after a certain size is reached,
or once at the end, depending on implementation details). The
callback should arrange to store or transmit those buffers without
changing their order.
If *buffer_callback* is None (the default), buffer views are
serialized into *file* as part of the pickle stream.
It is an error if *buffer_callback* is not None and *protocol* is
None or smaller than 5.
"""
def pickle.dumps(obj, protocol=None, *, ..., buffer_callback=None):
"""
See above for *buffer_callback*.
"""
``pickle.Unpickler.__init__`` and ``pickle.loads`` are augmented with an
additional ``buffers`` parameter::
class Unpickler:
def __init__(file, *, ..., buffers=None):
"""
If *buffers* is not None, it should be an iterable of buffer-enabled
objects that is consumed each time the pickle stream references
an out-of-band buffer view. Such buffers have been given in order
to the *buffer_callback* of a Pickler object.
If *buffers* is None (the default), then the buffers are taken
from the pickle stream, assuming they are serialized there.
It is an error for *buffers* to be None if the pickle stream
was produced with a non-None *buffer_callback*.
"""
def pickle.loads(data, *, ..., buffers=None):
"""
See above for *buffers*.
"""
Protocol changes
================
Three new opcodes are introduced:
* ``BYTEARRAY`` creates a bytearray from the data following it in the pickle
stream and pushes it on the stack (just like ``BINBYTES8`` does for bytes
objects);
* ``NEXT_BUFFER`` fetches a buffer from the ``buffers`` iterable and pushes
it on the stack.
* ``READONLY_BUFFER`` makes a readonly view of the top of the stack.
When pickling encounters a ``PickleBuffer``, there can be four cases:
* If a ``buffer_callback`` is given and the ``PickleBuffer`` is writable,
the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode
is appended to the pickle stream.
* If a ``buffer_callback`` is given and the ``PickleBuffer`` is readonly,
the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode
is appended to the pickle stream, followed by a ``READONLY_BUFFER`` opcode.
* If no ``buffer_callback`` is given and the ``PickleBuffer`` is writable,
it is serialized into the pickle stream as if it were a ``bytearray`` object.
* If no ``buffer_callback`` is given and the ``PickleBuffer`` is readonly,
it is serialized into the pickle stream as if it were a ``bytes`` object.
The distinction between readonly and writable buffers is explained below
(see "Mutability").
Caveats
=======
Mutability
----------
PEP 3118 buffers [#pep-3118]_ can be readonly or writable. Some objects,
such as Numpy arrays, need to be backed by a mutable buffer for full
operation. Pickle consumers that use the ``buffer_callback`` and ``buffers``
arguments will have to be careful to recreate mutable buffers. When doing
I/O, this implies using buffer-passing API variants such as ``readinto``
(which are also often preferrable for performance).
Data sharing
------------
If you pickle and then unpickle an object in the same process, passing
out-of-band buffer views, then the unpickled object may be backed by the
same buffer as the original pickled object.
For example, it might be reasonable to implement reduction of a Numpy array
as follows (crucial metadata such as shapes is omitted for simplicity)::
class ndarray:
def __reduce_ex__(self, protocol):
if protocol == 5:
return numpy.frombuffer, (PickleBuffer(self), self.dtype)
# Legacy code for earlier protocols omitted
Then simply passing the PickleBuffer around from ``dumps`` to ``loads``
will produce a new Numpy array sharing the same underlying memory as the
original Numpy object (and, incidentally, keeping it alive)::
>>> import numpy as np
>>> a = np.zeros(10)
>>> a[0]
0.0
>>> buffers = []
>>> data = pickle.dumps(a, protocol=5, buffer_callback=buffers.extend)
>>> b = pickle.loads(data, buffers=buffers)
>>> b[0] = 42
>>> a[0]
42.0
This won't happen with the traditional ``pickle`` API (i.e. without passing
``buffers`` and ``buffer_callback`` parameters), because then the buffer view
is serialized inside the pickle stream with a copy.
Alternatives
============
The ``pickle`` persistence interface is a way of storing references to
designated objects in the pickle stream while handling their actual
serialization out of band. For example, one might consider the following
for zero-copy serialization of bytearrays::
class MyPickle(pickle.Pickler):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.buffers = []
def persistent_id(self, obj):
if type(obj) is not bytearray:
return None
else:
index = len(self.buffers)
self.buffers.append(obj)
return ('bytearray', index)
class MyUnpickle(pickle.Unpickler):
def __init__(self, *args, buffers, **kwargs):
super().__init__(*args, **kwargs)
self.buffers = buffers
def persistent_load(self, pid):
type_tag, index = pid
if type_tag == 'bytearray':
return self.buffers[index]
else:
assert 0 # unexpected type
This mechanism has two drawbacks:
* Each ``pickle`` consumer must reimplement ``Pickler`` and ``Unpickler``
subclasses, with custom code for each type of interest. Essentially,
N pickle consumers end up each implementing custom code for M producers.
This is difficult (especially for sophisticated types such as Numpy
arrays) and poorly scalable.
* Each object encountered by the pickle module (even simple built-in objects
such as ints and strings) triggers a call to the user's ``persistent_id()``
method, leading to a possible performance drop compared to nominal.
Open questions
==============
Should ``buffer_callback`` take a single buffers or a sequence of buffers?
* Taking a single buffer would allow returning a boolean indicating whether
the given buffer is serialized in-band or out-of-band.
* Taking a sequence of buffers is potentially more efficient by reducing
function call overhead.
Related work
============
Dask.distributed implements a custom zero-copy serialization with fallback
to pickle [#dask-serialization]_.
PyArrow implements zero-copy component-based serialization for a few
selected types [#pyarrow-serialization]_.
PEP 554 proposes hosting multiple interpreters in a single process, with
provisions for transferring buffers between interpreters as a communication
scheme [#pep-554]_.
Acknowledgements
================
Thanks to the following people for early feedback: Nick Coghlan, Olivier
Grisel, Stefan Krah, MinRK, Matt Rocklin, Eric Snow.
References
==========
.. [#dask] Dask.distributed -- A lightweight library for distributed computing
in Python
https://distributed.readthedocs.io/
.. [#dask-serialization] Dask.distributed custom serialization
https://distributed.readthedocs.io/en/latest/serialization.html
.. [#ipyparallel] IPyParallel -- Using IPython for parallel computing
https://ipyparallel.readthedocs.io/
.. [#pyarrow] PyArrow -- A cross-language development platform for in-memory data
https://arrow.apache.org/docs/python/
.. [#pyarrow-serialization] PyArrow IPC and component-based serialization
https://arrow.apache.org/docs/python/ipc.html#component-based-serialization
.. [#pep-3118] PEP 3118 -- Revising the buffer protocol
https://www.python.org/dev/peps/pep-3118/
.. [#pep-554] PEP 554 -- Multiple Interpreters in the Stdlib
https://www.python.org/dev/peps/pep-0554/
Copyright
=========
This document has been placed into the public domain.
Hi,
On Twitter, Raymond Hettinger wrote:
"The decision making process on Python-dev is an anti-pattern,
governed by anecdotal data and ambiguity over what problem is solved."
https://twitter.com/raymondh/status/887069454693158912
About "anecdotal data", I would like to discuss the Python startup time.
== Python 3.7 compared to 2.7 ==
First of all, on speed.python.org, we have:
* Python 2.7: 6.4 ms with site, 3.0 ms without site (-S)
* master (3.7): 14.5 ms with site, 8.4 ms without site (-S)
Python 3.7 startup time is 2.3x slower with site (default mode), or
2.8x slower without site (-S command line option).
(I will skip Python 3.4, 3.5 and 3.6 which are much worse than Python 3.7...)
So if an user complained about Python 2.7 startup time: be prepared
for a 2x - 3x more angry user when "forced" to upgrade to Python 3!
== Mercurial vs Git, Python vs C, startup time ==
Startup time matters a lot for Mercurial since Mercurial is compared
to Git. Git and Mercurial have similar features, but Git is written in
C whereas Mercurial is written in Python. Quick benchmark on the
speed.python.org server:
* hg version: 44.6 ms +- 0.2 ms
* git --version: 974 us +- 7 us
Mercurial startup time is already 45.8x slower than Git whereas tested
Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
developers, with a startup time 2x - 3x slower...
I tested Mecurial 3.7.3 and Git 2.7.4 on Ubuntu 16.04.1 using "python3
-m perf command -- ...".
== CPython core developers don't care? no, they do care ==
Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me
(Victor Stinner) and other core developers made multiple changes last
years to reduce the number of imports at startup, optimize impotlib,
etc.
IHMO all these core developers are well aware of the competition of
programming languages, and honesty Python startup time isn't "good".
So let's compare it to other programming languages similar to Python.
== PHP, Ruby, Perl ==
I measured the startup time of other programming languages which are
similar to Python, still on the speed.python.org server using "python3
-m perf command -- ...":
* perl -e ' ': 1.18 ms +- 0.01 ms
* php -r ' ': 8.57 ms +- 0.05 ms
* ruby -e ' ': 32.8 ms +- 0.1 ms
Wow, Perl is quite good! PHP seems as good as Python 2 (but Python 3
is worse). Ruby startup time seems less optimized than other
languages.
Tested versions:
* perl 5, version 22, subversion 1 (v5.22.1)
* PHP 7.0.18-0ubuntu0.16.04.1 (cli) ( NTS )
* ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]
== Quick Google search ==
I also searched for "python startup time" and "python slow startup
time" on Google and found many articles. Some examples:
"Reducing the Python startup time"
http://www.draketo.de/book/export/html/498
=> "The python startup time always nagged me (17-30ms) and I just
searched again for a way to reduce it, when I found this: The
Python-Launcher caches GTK imports and forks new processes to reduce
the startup time of python GUI programs."
https://nelsonslog.wordpress.com/2013/04/08/python-startup-time/
=> "Wow, Python startup time is worse than I thought."
"How to speed up python starting up and/or reduce file search while
loading libraries?"
https://stackoverflow.com/questions/15474160/how-to-speed-up-python-startin…
=> "The first time I log to the system and start one command it takes
6 seconds just to show a few line of help. If I immediately issue the
same command again it takes 0.1s. After a couple of minutes it gets
back to 6s. (proof of short-lived cache)"
"How does one optimise the startup of a Python script/program?"
https://www.quora.com/How-does-one-optimise-the-startup-of-a-Python-script-…
=> "I wrote a Python program that would be used very often (imagine
'cd' or 'ls') for very short runtimes, how would I make it start up as
fast as possible?"
"Python Interpreter Startup time"
https://bytes.com/topic/python/answers/34469-pyhton-interpreter-startup-time
"Python is very slow to start on Windows 7"
https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-o…
=> "Python takes 17 times longer to load on my Windows 7 machine than
Ubuntu 14.04 running on a VM"
=> "returns in 0.614s on Windows and 0.036s on Linux"
"How to make a fast command line tool in Python" (old article Python 2.5.2)
https://files.bemusement.org/talks/OSDC2008-FastPython/
=> "(...) some techniques Bazaar uses to start quickly, such as lazy imports."
--
So please continue efforts for make Python startup even faster to beat
all other programming languages, and finally convince Mercurial to
upgrade ;-)
Victor
Hi folks,
As some people here know I've been working off and on for a while to
improve CPython's support of Cygwin. I'm motivated in part by a need
to have software working on Python 3.x on Cygwin for the foreseeable
future, preferably with minimal graft. (As an incidental side-effect
Python's test suite--especially of system-level functionality--serves
as an interesting test suite for Cygwin itself too.)
This is partly what motivated PEP 539 [1], although that PEP had the
advantage of benefiting other POSIX-compatible platforms as well (and
in fact was fixing an aspect of CPython that made it unfriendly to
supporting other platforms).
As far as I can tell, the first commit to Python to add any kind of
support for Cygwin was made by Guido (committing a contributed patch)
back in 1999 [2]. Since then, bits and pieces have been added for
Cygwin's benefit over time, with varying degrees of impact in terms of
#ifdefs and the like (for the most part Cygwin does not require *much*
in the way of special support, but it does have some differences from
a "normal" POSIX-compliant platform, such as the possibility for
case-insensitive filesystems and executables that end in .exe). I
don't know whether it's ever been "officially supported" but someone
with a longer memory of the project can comment on that. I'm not sure
if it was discussed at all or not in the context of PEP 11.
I have personally put in a fair amount of effort already in either
fixing issues on Cygwin (many of these issues also impact MinGW), or
more often than not fixing issues in the CPython test suite on
Cygwin--these are mostly tests that are broken due to invalid
assumptions about the platform (for example, that there is always a
"root" user with uid=0; this is not the case on Cygwin). In other
cases some tests need to be skipped or worked around due to
platform-specific bugs, and Cygwin is hardly the only case of this in
the test suite.
I also have an experimental AppVeyor configuration for running the
tests on Cygwin [3], as well as an experimental buildbot (not
available on the internet, but working). These currently rely on a
custom branch that includes fixes needed for the test suite to run to
completion without crashing or hanging (e.g.
https://bugs.python.org/issue31885). It would be nice to add this as
an official buildbot, but I'm not sure if it makes sense to do that
until it's "green", or at least not crashing. I have several other
patches to the tests toward this goal, and am currently down to ~22
tests failing.
Before I do any more work on this, however, it would be best to once
and for all clarify the support for Cygwin in CPython, as it has never
been "officially supported" nor unsupported--this way we can avoid
having this discussion every time a patch related to Cygwin comes up.
I could provide some arguments for why I believe Cygwin should
supported, but before this gets too long I'd just like to float the
idea of having the discussion in the first place. It's also not
exactly clear to me how to meet the standards in PEP 11 for supporting
a platform--in particular it's not clear when a buildbot is considered
"stable", or how to achieve that without getting necessary fixes
merged into the main branch in the first place.
Thanks,
Erik
[1] https://www.python.org/dev/peps/pep-0539/
[2] https://github.com/python/cpython/commit/717d1fdf2acbef5e6b47d9b4dcf48ef182…
[3] https://ci.appveyor.com/project/embray/cpython
Hi,
I have been asked to express myself on the PEP 572. I'm not sure that
it's useful, but here is my personal opinion on the proposed
"assignment expressions".
PEP 572 -- Assignment Expressions:
https://www.python.org/dev/peps/pep-0572/
First of all, I concur with others: Chris Angelico did a great job to
design a good and full PEP, and a working implementation which is also
useful to play with it!
WARNING! I was (strongly) opposed to PEP 448 Unpacking Generalizations
(ex: [1, 2, *list]) and PEP 498 f-string (f"Hello {name}"), whereas I
am now a happy user of these new syntaxes. So I'm not sure that I have
good tastes :-)
Tim Peter gaves the following example. "LONG" version:
diff = x - x_base
if diff:
g = gcd(diff, n)
if g > 1:
return g
versus the "SHORT" version:
if (diff := x - x_base) and (g := gcd(diff, n)) > 1:
return g
== Write ==
If your job is to write code: the SHORT version can be preferred since
it's closer to what you have in mind and the code is shorter. When you
read your own code, it seems straightforward and you like to see
everything on the same line.
The LONG version looks like your expressiveness is limited by the
computer. It's like having to use simple words when you talk to a
child, because a child is unable to understand more subtle and
advanced sentences. You want to write beautiful code for adults,
right?
== Read and Understand ==
In my professional experience, I spent most of my time on reading
code, rather than writing code. By reading, I mean: try to understand
why this specific bug that cannot occur... is always reproduced by the
customer, whereas we fail to reproduce it in our test lab :-) This bug
is impossible, you know it, right?
So let's say that you never read the example before, and it has a bug.
By "reading the code", I really mean understanding here. In your
opinion, which version is easier to *understand*, without actually
running the code?
IMHO the LONG version is simpler to understand, since the code is
straightforward, it's easy to "guess" the *control flow* (guess in
which order instructions will be executed).
Print the code on paper and try to draw lines to follow the control
flow. It may be easier to understand how SHORT is more complex to
understand than LONG.
== Debug ==
Now let's imagine that you can run the code (someone succeeded to
reproduce the bug in the test lab!). Since it has a bug, you now
likely want to try to understand why the bug occurs using a debugger.
Sadly, most debugger are designed as if a single line of code can only
execute a single instruction. I tried pdb: you cannot only run (diff
:= x - x_base) and then get "diff" value, before running the second
assingment, you can only execute the *full line* at once.
I would say that the LONG version is easier to debug, at least using pdb.
I'm using regularly gdb which implements the "step" command as I
expect (don't execute the full line, execute sub expressions one by
one), but it's still harder to follow the control flow when a single
line contains multiple instructions, than debugging lines with a
single instruction.
You can see it as a limitation of pdb, but many tools only have the
granularity of whole line. Think about tracebacks. If you get an
exception at "line 1" in the SHORT example (the long "if" expression),
what can you deduce from the line number? What happened?
If you get an exception in the LONG example, the line number gives you
a little bit more information... maybe just enough to understand the
bug?
Example showing the pdb limitation:
>>> def f():
... breakpoint()
... if (x:=1) and (y:=2): pass
...
>>> f()
> <stdin>(3)f()
(Pdb) p x
*** NameError: name 'x' is not defined
(Pdb) p y
*** NameError: name 'y' is not defined
(Pdb) step
--Return--
> <stdin>(3)f()->None
(Pdb) p x
1
(Pdb) p y
2
... oh, pdb gone too far. I expected a break after "x := 1" and before
"y := 2" :-(
== Write code for babies! ==
Please don't write code for yourself, but write code for babies! :-)
These babies are going to maintain your code for the next 5 years,
while you moved to a different team or project in the meanwhile. Be
kind with your coworkers and juniors!
I'm trying to write a single instruction per line whenever possible,
even if the used language allows me much more complex expressions.
Even if the C language allows assignments in if, I avoid them, because
I regularly have to debug my own code in gdb ;-)
Now the question is which Python are allowed for babies. I recall that
a colleague was surprised and confused by context managers. Does it
mean that try/finally should be preferred? What about f'Hello
{name.title()}' which calls a method into a "string" (formatting)? Or
metaclasses? I guess that the limit should depend on your team, and
may be explained in the coding style designed by your whole team?
Victor
[Victor Stinner]
...
> Tim Peter gaves the following example. "LONG" version:
>
> diff = x - x_base
> if diff:
> g = gcd(diff, n)
> if g > 1:
> return g
>
> versus the "SHORT" version:
>
> if (diff := x - x_base) and (g := gcd(diff, n)) > 1:
> return g
>
> == Write ==
>
> If your job is to write code: the SHORT version can be preferred since
> it's closer to what you have in mind and the code is shorter. When you
> read your own code, it seems straightforward and you like to see
> everything on the same line.
All so, but a bit more: in context, this is just one block in a
complex algorithm. The amount of _vertical_ screen space it consumes
directly affects how much of what comes before and after it can be
seen without scrolling. Understanding this one block in isolation is
approximately useless unless you can also see how it fits into the
whole. Saving 3 lines of 5 is substantial, but it's more often saving
1 of 5 or 6. Regardless, they add up.
> The LONG version looks like your expressiveness is limited by the
> computer. It's like having to use simple words when you talk to a
> child, because a child is unable to understand more subtle and
> advanced sentences. You want to write beautiful code for adults,
> right?
I want _the whole_ to be as transparent as possible. That's a
complicated balancing act in practice.
> == Read and Understand ==
>
> In my professional experience, I spent most of my time on reading
> code, rather than writing code. By reading, I mean: try to understand
> why this specific bug that cannot occur... is always reproduced by the
> customer, whereas we fail to reproduce it in our test lab :-) This bug
> is impossible, you know it, right?
>
> So let's say that you never read the example before, and it has a bug.
Then you're screwed - pay me to fix it ;-) Seriously, as above, this
block on its own is senseless without understanding both the
mathematics behind what it's doing, and on how all the code before it
picked `x` and `x_base` to begin with.
> By "reading the code", I really mean understanding here. In your
> opinion, which version is easier to *understand*, without actually
> running the code?
Honestly, I find the shorter version a bit easier to understand:
fewer indentation levels, and less semantically empty repetition of
names.
> IMHO the LONG version is simpler to understand, since the code is
> straightforward, it's easy to "guess" the *control flow* (guess in
> which order instructions will be executed).
You're saying you don't know that in "x and y" Python evaluates x
first, and only evaluates y if x "is truthy"? Sorry, but this seems
trivial to me in either spelling.
> Print the code on paper and try to draw lines to follow the control
> flow. It may be easier to understand how SHORT is more complex to
> understand than LONG.
Since they're semantically identical, there's _something_ suspect
about a conclusion that one is _necessarily_ harder to understand than
the other ;-) I don't have a problem with you finding the longer
version easier to understand, but I do have a problem if you have a
problem with me finding the shorter easier.
> == Debug ==
>
> Now let's imagine that you can run the code (someone succeeded to
> reproduce the bug in the test lab!). Since it has a bug, you now
> likely want to try to understand why the bug occurs using a debugger.
>
> Sadly, most debugger are designed as if a single line of code can only
> execute a single instruction. I tried pdb: you cannot only run (diff
> := x - x_base) and then get "diff" value, before running the second
> assingment, you can only execute the *full line* at once.
>
> I would say that the LONG version is easier to debug, at least using pdb.
That might be a good reason to avoid, say, list comprehensions (highly
complex expressions of just about any kind), but I think this
overlooks the primary _point_ of "binding expressions": to give names
to intermediate results. I couldn't care less if pdb executes the
whole "if" statement in one gulp, because I get exactly the same info
either way: the names `diff` and `g` bound to the results of the
expressions they named. What actual difference does it make whether
pdb binds the names one at a time, or both, before it returns to the
prompt?
Binding expressions are debugger-friendly in that they _don't_ just
vanish without a trace. It's their purpose to _capture_ the values of
the expressions they name. Indeed, you may want to add them all over
the place inside expressions, never intending to use the names, just
so that you can see otherwise-ephemeral intra-expression results in
your debugger ;-)
> ... Think about tracebacks. If you get an xception at "line 1" in the
> SHORT example (the long "if" expression), what can you deduce
> from the line number? What happened?
>
> If you get an exception in the LONG example, the line number gives you
> a little bit more information... maybe just enough to understand the
> bug?
This one I wholly agree with, in general. In the specific example at
hand, it's weak, because there's so little that _could_ raise an
exception. For example, if the variables weren't bound to integers,
in context the code would have blown up long before reaching this
block. Python ints are unbounded, so overflow in "-" or "gcd" aren't
possible either. MemoryError is theoretically possible, and in that
case it would be good to know whether it happened during "-" or during
"gcd()". Good to know, but not really helpful, because either way you
ran out of memory :-(
> == Write code for babies! ==
>
> Please don't write code for yourself, but write code for babies! :-)
>
> These babies are going to maintain your code for the next 5 years,
> while you moved to a different team or project in the meanwhile. Be
> kind with your coworkers and juniors!
>
> I'm trying to write a single instruction per line whenever possible,
> even if the used language allows me much more complex expressions.
> Even if the C language allows assignments in if, I avoid them, because
> I regularly have to debug my own code in gdb ;-)
>
> Now the question is which Python are allowed for babies. I recall that
> a colleague was surprised and confused by context managers. Does it
> mean that try/finally should be preferred? What about f'Hello
> {name.title()}' which calls a method into a "string" (formatting)? Or
> metaclasses? I guess that the limit should depend on your team, and
> may be explained in the coding style designed by your whole team?
It's the kind of thing I prefer to leave to team style guides, because
consensus will never be reached. In a different recent thread,
someone complained about using functions at all, because their names
are never wholly accurate, and in any case they hide what's "really"
going on. To my eyes, that was an unreasonably extreme "write code
for babies" position.
If a style guide banned using "and" or "or" in Python "if" or "while"
tests, I'd find that less extreme, but also unreasonable.
But if a style guide banned functions with more than 50 formal
arguments, I'd find that unreasonably tolerant.
Luckily, I only have to write code for me now, so am free to pick the
perfect compromise in every case ;-)
In pondering our approach to future Python major releases, I found
myself considering the experience we've had with Python 3. The whole
Py3k effort predates my involvement in the community so I missed a
bunch of context about the motivations, decisions, and challenges.
While I've pieced some of that together over the years now since I've
been around, I've certainly seen much of the aftermath. For me, at
least, it would be helpful to have a bit more insight into the
history. :)
With that in mind, it would be worth having an informational PEP with
an authoritative retrospective on the lessons learned from the Python
3 effort (and transition). Consider it a sort of autobiography,
"memoirs on the python-dev change to Python 3". :) At this point the
transition has settled in enough that we should be able to present a
relatively objective (and consistent) view, while we're not so far
removed that we've forgotten anything important. :) If such a
document already exists then I'd love a pointer to it.
The document would benefit (among others):
* python-dev (by giving us a clear viewpoint to inform decisions about
future releases)
* new-comers to Python that want more insight into the language
* folks transitioning from 2 to 3
* communities that have (or think they have) problems similar to those
we faced in Python 2
The PEP doesn't even have to be done all at once, nor by one person.
In fact, there are many viewpoints that would add value to the
document. Hence it would probably make sense to encourage broad
participation and then have a single editor to effect a single voice
in the document.
The contents of the retrospective document should probably cover a
broad range of topics, since there's so much to learn from the move to
Python 3. To give an indication of what I mean, I've included a rough
outline at the bottom of this message.
So...I typically strongly avoid making proposals that I'm not willing
to execute. However, in this case I simply do not have enough
experience in the history to feel comfortable doing a good job of it
in a reasonable amount of time (which matters due to the tendency of
valuable info to fade away). :/ I have no expectation that someone
will pick this up, though I do hope since the benefit would be
significant. My apologies in advance if this wasted anyone's time.
-eric
++++++++++++++++++++++++++++++++
I'd hope to see something along the lines of (at least) the following,
in rough order:
* a concise summary of the document at the top (very meta, I know :) )
+ what were we solving?
+ what was the solution?
+ why do it that way?
+ what went right?
+ what went wrong?
+ impact on the community
+ impact on core dev contribution
* timeline
* key players (and level of involvement)
+ old guard core devs
+ new guard
+ folks brought on for Py3k (e.g. IIRC a swarm of Googlers dove in)
+ non-core-devs
* motivations
* expectations (e.g. time frames, community reaction)
* corresponding results
* a summary of what we did
* alternative approaches
* what went right (and was it on purpose :) )
* what went wrong (e.g. io) and why
* how the Py3k project differed from normal python-dev workflow (e.g.
pace, decision-making, communications)
* lasting impact of python-dev
* key things that would have been better if done differently
* key decisions/planning (mostly a priori to the release work)
+ scope of backward compatibility
+ process (using PEPs with PEPs 30xx guiding)
+ schedule
+ specific changes (i.e. PEPs 31xx)
+ what was left out (and why)
+ plans to help library and app authors transition (e.g. 2to3)
+ feature/schedule overlap with Python 2 (i.e. 2.6 and 2.7)
+ the language moratorium
* things that got missed and why
+ unicode/bytes in some stdlib modules (and builtins?)
* things that were overdone (and how that got missed)
+ unicode/bytes in some stdlib modules (and builtins?)
* (last but not least) challenges faced by folks working to transition
their exiting code to Python 3
Hi All,
We’re planning to finish up the bugs.python.org migration to Red Hat
OpenShift by May 14th (US Pycon Sprints). For the most part
everything will stay same, with the exception of cleaning up some old
URL’s and redirects from the previous hosting provider: Upfront
Software.
We will post a more concrete timeline here by May 1st, but wanted to
share this exciting news to move bugs.python.org into a more stable
and optimal state.
Thank you all for your patience and feedback. A special thanks to
Maciej Szulik and Red Hat for helping the PSF with this project.
Best regards,
Mark
--
Mark Mangoba | PSF IT Manager | Python Software Foundation |
mmangoba(a)python.org | python.org | Infrastructure Staff:
infrastructure-staff(a)python.org | GPG: 2DE4 D92B 739C 649B EBB8 CCF6
DC05 E024 5F4C A0D1
PEP 572 caused a strong emotional reaction in me. I wanted to first understand
my intuitive objection to the idea before posting anything.
I feel that (name := expression) doesn't fit the narrative of PEP 20. It
doesn't remove complexity, it only moves it. What was its own assignment before
now is part of the logic test. This saves on vertical whitespace but makes
parsing and understanding logic tests harder. This is a bad bargain: logic
tests already contain a lot of complexity that human readers have to cope with.
Proponents of := argue it makes several patterns flatter (= better than nested)
to express. Serial regular expression matching is a popular example. However,
(name := expression) itself is making logic tests more nested, not flatter. It
makes information in the logic test denser (= worse than sparse). Since it also
requires an additional pair of parentheses, it forces the reader to decompose
the expression in their head.
:= also goes against having one obvious way to do it. Since it's an expression,
it can also be placed on its own line or in otherwise weird places like
function call arguments. I anticipate PEP 8 would have to be extended to
explicitly discourage such abuse. Linters would grow rules against it. This is
noise.
I'm -1 on PEP 572, I think it's very similar in spirit to the rejected PEP 463.
-- Ł
On 2018-04-13 21:30, Raymond Hettinger wrote:
> It would be nice to have a section that specifically discusses the implications with respect to other existing function-like tooling: classmethod, staticmethod, partial, itemgetter, attrgetter, methodgetter, etc.
My hope is that there are no such implications. An important design goal
of this PEP (which I believe I achieved) is that as long as you're doing
duck typing, you should be safe. I believe that the tools in your list
do exactly that.
It's only when you use inspect or when you do type checks that you will
see the difference with this PEP.
After implementing the C code part of my PEP, there were only a
relatively small number of test failures. You can look at this commit
which contains all Python code changes of my implementation, it doesn't
look so bad:
https://github.com/jdemeyer/cpython/commit/c404a8f1b7d9525dd2842712fe183a05…
> For example, I would need to update the code in random._randbelow().
For the record, there are no test failures related to this, but maybe
that's just because tests for this are missing.