webmaster has already heard from 4 people who cannot install it.
I sent them to the bug tracker or to python-list but they seem
not to have gone either place. Is there some guide I should be
sending them to, 'how to debug installation problems'?
Laura
Hi,
I'd like to submit this PEP for discussion. It is quite specialized
and the main target audience of the proposed changes is
users and authors of applications/libraries transferring large amounts
of data (read: the scientific computing & data science ecosystems).
https://www.python.org/dev/peps/pep-0574/
The PEP text is also inlined below.
Regards
Antoine.
PEP: 574
Title: Pickle protocol 5 with out-of-band data
Version: $Revision$
Last-Modified: $Date$
Author: Antoine Pitrou <solipsis(a)pitrou.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 23-Mar-2018
Post-History:
Resolution:
Abstract
========
This PEP proposes to standardize a new pickle protocol version, and
accompanying APIs to take full advantage of it:
1. A new pickle protocol version (5) to cover the extra metadata needed
for out-of-band data buffers.
2. A new ``PickleBuffer`` type for ``__reduce_ex__`` implementations
to return out-of-band data buffers.
3. A new ``buffer_callback`` parameter when pickling, to handle out-of-band
data buffers.
4. A new ``buffers`` parameter when unpickling to provide out-of-band data
buffers.
The PEP guarantees unchanged behaviour for anyone not using the new APIs.
Rationale
=========
The pickle protocol was originally designed in 1995 for on-disk persistency
of arbitrary Python objects. The performance of a 1995-era storage medium
probably made it irrelevant to focus on performance metrics such as
use of RAM bandwidth when copying temporary data before writing it to disk.
Nowadays the pickle protocol sees a growing use in applications where most
of the data isn't ever persisted to disk (or, when it is, it uses a portable
format instead of Python-specific). Instead, pickle is being used to transmit
data and commands from one process to another, either on the same machine
or on multiple machines. Those applications will sometimes deal with very
large data (such as Numpy arrays or Pandas dataframes) that need to be
transferred around. For those applications, pickle is currently
wasteful as it imposes spurious memory copies of the data being serialized.
As a matter of fact, the standard ``multiprocessing`` module uses pickle
for serialization, and therefore also suffers from this problem when
sending large data to another process.
Third-party Python libraries, such as Dask [#dask]_, PyArrow [#pyarrow]_
and IPyParallel [#ipyparallel]_, have started implementing alternative
serialization schemes with the explicit goal of avoiding copies on large
data. Implementing a new serialization scheme is difficult and often
leads to reduced generality (since many Python objects support pickle
but not the new serialization scheme). Falling back on pickle for
unsupported types is an option, but then you get back the spurious
memory copies you wanted to avoid in the first place. For example,
``dask`` is able to avoid memory copies for Numpy arrays and
built-in containers thereof (such as lists or dicts containing Numpy
arrays), but if a large Numpy array is an attribute of a user-defined
object, ``dask`` will serialize the user-defined object as a pickle
stream, leading to memory copies.
The common theme of these third-party serialization efforts is to generate
a stream of object metadata (which contains pickle-like information about
the objects being serialized) and a separate stream of zero-copy buffer
objects for the payloads of large objects. Note that, in this scheme,
small objects such as ints, etc. can be dumped together with the metadata
stream. Refinements can include opportunistic compression of large data
depending on its type and layout, like ``dask`` does.
This PEP aims to make ``pickle`` usable in a way where large data is handled
as a separate stream of zero-copy buffers, letting the application handle
those buffers optimally.
Example
=======
To keep the example simple and avoid requiring knowledge of third-party
libraries, we will focus here on a bytearray object (but the issue is
conceptually the same with more sophisticated objects such as Numpy arrays).
Like most objects, the bytearray object isn't immediately understood by
the pickle module and must therefore specify its decomposition scheme.
Here is how a bytearray object currently decomposes for pickling::
>>> b.__reduce_ex__(4)
(<class 'bytearray'>, (b'abc',), None)
This is because the ``bytearray.__reduce_ex__`` implementation reads
morally as follows::
class bytearray:
def __reduce_ex__(self, protocol):
if protocol == 4:
return type(self), bytes(self), None
# Legacy code for earlier protocols omitted
In turn it produces the following pickle code::
>>> pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=4)))
0: \x80 PROTO 4
2: \x95 FRAME 30
11: \x8c SHORT_BINUNICODE 'builtins'
21: \x8c SHORT_BINUNICODE 'bytearray'
32: \x93 STACK_GLOBAL
33: C SHORT_BINBYTES b'abc'
38: \x85 TUPLE1
39: R REDUCE
40: . STOP
(the call to ``pickletools.optimize`` above is only meant to make the
pickle stream more readable by removing the MEMOIZE opcodes)
We can notice several things about the bytearray's payload (the sequence
of bytes ``b'abc'``):
* ``bytearray.__reduce_ex__`` produces a first copy by instantiating a
new bytes object from the bytearray's data.
* ``pickle.dumps`` produces a second copy when inserting the contents of
that bytes object into the pickle stream, after the SHORT_BINBYTES opcode.
* Furthermore, when deserializing the pickle stream, a temporary bytes
object is created when the SHORT_BINBYTES opcode is encountered (inducing
a data copy).
What we really want is something like the following:
* ``bytearray.__reduce_ex__`` produces a *view* of the bytearray's data.
* ``pickle.dumps`` doesn't try to copy that data into the pickle stream
but instead passes the buffer view to its caller (which can decide on the
most efficient handling of that buffer).
* When deserializing, ``pickle.loads`` takes the pickle stream and the
buffer view separately, and passes the buffer view directly to the
bytearray constructor.
We see that several conditions are required for the above to work:
* ``__reduce__`` or ``__reduce_ex__`` must be able to return *something*
that indicates a serializable no-copy buffer view.
* The pickle protocol must be able to represent references to such buffer
views, instructing the unpickler that it may have to get the actual buffer
out of band.
* The ``pickle.Pickler`` API must provide its caller with a way
to receive such buffer views while serializing.
* The ``pickle.Unpickler`` API must similarly allow its caller to provide
the buffer views required for deserialization.
* For compatibility, the pickle protocol must also be able to contain direct
serializations of such buffer views, such that current uses of the ``pickle``
API don't have to be modified if they are not concerned with memory copies.
Producer API
============
We are introducing a new type ``pickle.PickleBuffer`` which can be
instantiated from any buffer-supporting object, and is specifically meant
to be returned from ``__reduce__`` implementations::
class bytearray:
def __reduce_ex__(self, protocol):
if protocol == 5:
return type(self), PickleBuffer(self), None
# Legacy code for earlier protocols omitted
``PickleBuffer`` is a simple wrapper that doesn't have all the memoryview
semantics and functionality, but is specifically recognized by the ``pickle``
module if protocol 5 or higher is enabled. It is an error to try to
serialize a ``PickleBuffer`` with pickle protocol version 4 or earlier.
Only the raw *data* of the ``PickleBuffer`` will be considered by the
``pickle`` module. Any type-specific *metadata* (such as shapes or
datatype) must be returned separately by the type's ``__reduce__``
implementation, as is already the case.
PickleBuffer objects
--------------------
The ``PickleBuffer`` class supports a very simple Python API. Its constructor
takes a single PEP 3118-compatible object [#pep-3118]_. ``PickleBuffer``
objects themselves support the buffer protocol, so consumers can
call ``memoryview(...)`` on them to get additional information
about the underlying buffer (such as the original type, shape, etc.).
On the C side, a simple API will be provided to create and inspect
PickleBuffer objects:
``PyObject *PyPickleBuffer_FromObject(PyObject *obj)``
Create a ``PickleBuffer`` object holding a view over the PEP 3118-compatible
*obj*.
``PyPickleBuffer_Check(PyObject *obj)``
Return whether *obj* is a ``PickleBuffer`` instance.
``const Py_buffer *PyPickleBuffer_GetBuffer(PyObject *picklebuf)``
Return a pointer to the internal ``Py_buffer`` owned by the ``PickleBuffer``
instance.
``PickleBuffer`` can wrap any kind of buffer, including non-contiguous
buffers. It's up to consumers to decide how best to handle different kinds
of buffers (for example, some consumers may find it acceptable to make a
contiguous copy of non-contiguous buffers).
Consumer API
============
``pickle.Pickler.__init__`` and ``pickle.dumps`` are augmented with an additional
``buffer_callback`` parameter::
class Pickler:
def __init__(self, file, protocol=None, ..., buffer_callback=None):
"""
If *buffer_callback* is not None, then it is called with a list
of out-of-band buffer views when deemed necessary (this could be
once every buffer, or only after a certain size is reached,
or once at the end, depending on implementation details). The
callback should arrange to store or transmit those buffers without
changing their order.
If *buffer_callback* is None (the default), buffer views are
serialized into *file* as part of the pickle stream.
It is an error if *buffer_callback* is not None and *protocol* is
None or smaller than 5.
"""
def pickle.dumps(obj, protocol=None, *, ..., buffer_callback=None):
"""
See above for *buffer_callback*.
"""
``pickle.Unpickler.__init__`` and ``pickle.loads`` are augmented with an
additional ``buffers`` parameter::
class Unpickler:
def __init__(file, *, ..., buffers=None):
"""
If *buffers* is not None, it should be an iterable of buffer-enabled
objects that is consumed each time the pickle stream references
an out-of-band buffer view. Such buffers have been given in order
to the *buffer_callback* of a Pickler object.
If *buffers* is None (the default), then the buffers are taken
from the pickle stream, assuming they are serialized there.
It is an error for *buffers* to be None if the pickle stream
was produced with a non-None *buffer_callback*.
"""
def pickle.loads(data, *, ..., buffers=None):
"""
See above for *buffers*.
"""
Protocol changes
================
Three new opcodes are introduced:
* ``BYTEARRAY`` creates a bytearray from the data following it in the pickle
stream and pushes it on the stack (just like ``BINBYTES8`` does for bytes
objects);
* ``NEXT_BUFFER`` fetches a buffer from the ``buffers`` iterable and pushes
it on the stack.
* ``READONLY_BUFFER`` makes a readonly view of the top of the stack.
When pickling encounters a ``PickleBuffer``, there can be four cases:
* If a ``buffer_callback`` is given and the ``PickleBuffer`` is writable,
the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode
is appended to the pickle stream.
* If a ``buffer_callback`` is given and the ``PickleBuffer`` is readonly,
the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode
is appended to the pickle stream, followed by a ``READONLY_BUFFER`` opcode.
* If no ``buffer_callback`` is given and the ``PickleBuffer`` is writable,
it is serialized into the pickle stream as if it were a ``bytearray`` object.
* If no ``buffer_callback`` is given and the ``PickleBuffer`` is readonly,
it is serialized into the pickle stream as if it were a ``bytes`` object.
The distinction between readonly and writable buffers is explained below
(see "Mutability").
Caveats
=======
Mutability
----------
PEP 3118 buffers [#pep-3118]_ can be readonly or writable. Some objects,
such as Numpy arrays, need to be backed by a mutable buffer for full
operation. Pickle consumers that use the ``buffer_callback`` and ``buffers``
arguments will have to be careful to recreate mutable buffers. When doing
I/O, this implies using buffer-passing API variants such as ``readinto``
(which are also often preferrable for performance).
Data sharing
------------
If you pickle and then unpickle an object in the same process, passing
out-of-band buffer views, then the unpickled object may be backed by the
same buffer as the original pickled object.
For example, it might be reasonable to implement reduction of a Numpy array
as follows (crucial metadata such as shapes is omitted for simplicity)::
class ndarray:
def __reduce_ex__(self, protocol):
if protocol == 5:
return numpy.frombuffer, (PickleBuffer(self), self.dtype)
# Legacy code for earlier protocols omitted
Then simply passing the PickleBuffer around from ``dumps`` to ``loads``
will produce a new Numpy array sharing the same underlying memory as the
original Numpy object (and, incidentally, keeping it alive)::
>>> import numpy as np
>>> a = np.zeros(10)
>>> a[0]
0.0
>>> buffers = []
>>> data = pickle.dumps(a, protocol=5, buffer_callback=buffers.extend)
>>> b = pickle.loads(data, buffers=buffers)
>>> b[0] = 42
>>> a[0]
42.0
This won't happen with the traditional ``pickle`` API (i.e. without passing
``buffers`` and ``buffer_callback`` parameters), because then the buffer view
is serialized inside the pickle stream with a copy.
Alternatives
============
The ``pickle`` persistence interface is a way of storing references to
designated objects in the pickle stream while handling their actual
serialization out of band. For example, one might consider the following
for zero-copy serialization of bytearrays::
class MyPickle(pickle.Pickler):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.buffers = []
def persistent_id(self, obj):
if type(obj) is not bytearray:
return None
else:
index = len(self.buffers)
self.buffers.append(obj)
return ('bytearray', index)
class MyUnpickle(pickle.Unpickler):
def __init__(self, *args, buffers, **kwargs):
super().__init__(*args, **kwargs)
self.buffers = buffers
def persistent_load(self, pid):
type_tag, index = pid
if type_tag == 'bytearray':
return self.buffers[index]
else:
assert 0 # unexpected type
This mechanism has two drawbacks:
* Each ``pickle`` consumer must reimplement ``Pickler`` and ``Unpickler``
subclasses, with custom code for each type of interest. Essentially,
N pickle consumers end up each implementing custom code for M producers.
This is difficult (especially for sophisticated types such as Numpy
arrays) and poorly scalable.
* Each object encountered by the pickle module (even simple built-in objects
such as ints and strings) triggers a call to the user's ``persistent_id()``
method, leading to a possible performance drop compared to nominal.
Open questions
==============
Should ``buffer_callback`` take a single buffers or a sequence of buffers?
* Taking a single buffer would allow returning a boolean indicating whether
the given buffer is serialized in-band or out-of-band.
* Taking a sequence of buffers is potentially more efficient by reducing
function call overhead.
Related work
============
Dask.distributed implements a custom zero-copy serialization with fallback
to pickle [#dask-serialization]_.
PyArrow implements zero-copy component-based serialization for a few
selected types [#pyarrow-serialization]_.
PEP 554 proposes hosting multiple interpreters in a single process, with
provisions for transferring buffers between interpreters as a communication
scheme [#pep-554]_.
Acknowledgements
================
Thanks to the following people for early feedback: Nick Coghlan, Olivier
Grisel, Stefan Krah, MinRK, Matt Rocklin, Eric Snow.
References
==========
.. [#dask] Dask.distributed -- A lightweight library for distributed computing
in Python
https://distributed.readthedocs.io/
.. [#dask-serialization] Dask.distributed custom serialization
https://distributed.readthedocs.io/en/latest/serialization.html
.. [#ipyparallel] IPyParallel -- Using IPython for parallel computing
https://ipyparallel.readthedocs.io/
.. [#pyarrow] PyArrow -- A cross-language development platform for in-memory data
https://arrow.apache.org/docs/python/
.. [#pyarrow-serialization] PyArrow IPC and component-based serialization
https://arrow.apache.org/docs/python/ipc.html#component-based-serialization
.. [#pep-3118] PEP 3118 -- Revising the buffer protocol
https://www.python.org/dev/peps/pep-3118/
.. [#pep-554] PEP 554 -- Multiple Interpreters in the Stdlib
https://www.python.org/dev/peps/pep-0554/
Copyright
=========
This document has been placed into the public domain.
Hi,
On Twitter, Raymond Hettinger wrote:
"The decision making process on Python-dev is an anti-pattern,
governed by anecdotal data and ambiguity over what problem is solved."
https://twitter.com/raymondh/status/887069454693158912
About "anecdotal data", I would like to discuss the Python startup time.
== Python 3.7 compared to 2.7 ==
First of all, on speed.python.org, we have:
* Python 2.7: 6.4 ms with site, 3.0 ms without site (-S)
* master (3.7): 14.5 ms with site, 8.4 ms without site (-S)
Python 3.7 startup time is 2.3x slower with site (default mode), or
2.8x slower without site (-S command line option).
(I will skip Python 3.4, 3.5 and 3.6 which are much worse than Python 3.7...)
So if an user complained about Python 2.7 startup time: be prepared
for a 2x - 3x more angry user when "forced" to upgrade to Python 3!
== Mercurial vs Git, Python vs C, startup time ==
Startup time matters a lot for Mercurial since Mercurial is compared
to Git. Git and Mercurial have similar features, but Git is written in
C whereas Mercurial is written in Python. Quick benchmark on the
speed.python.org server:
* hg version: 44.6 ms +- 0.2 ms
* git --version: 974 us +- 7 us
Mercurial startup time is already 45.8x slower than Git whereas tested
Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
developers, with a startup time 2x - 3x slower...
I tested Mecurial 3.7.3 and Git 2.7.4 on Ubuntu 16.04.1 using "python3
-m perf command -- ...".
== CPython core developers don't care? no, they do care ==
Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me
(Victor Stinner) and other core developers made multiple changes last
years to reduce the number of imports at startup, optimize impotlib,
etc.
IHMO all these core developers are well aware of the competition of
programming languages, and honesty Python startup time isn't "good".
So let's compare it to other programming languages similar to Python.
== PHP, Ruby, Perl ==
I measured the startup time of other programming languages which are
similar to Python, still on the speed.python.org server using "python3
-m perf command -- ...":
* perl -e ' ': 1.18 ms +- 0.01 ms
* php -r ' ': 8.57 ms +- 0.05 ms
* ruby -e ' ': 32.8 ms +- 0.1 ms
Wow, Perl is quite good! PHP seems as good as Python 2 (but Python 3
is worse). Ruby startup time seems less optimized than other
languages.
Tested versions:
* perl 5, version 22, subversion 1 (v5.22.1)
* PHP 7.0.18-0ubuntu0.16.04.1 (cli) ( NTS )
* ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]
== Quick Google search ==
I also searched for "python startup time" and "python slow startup
time" on Google and found many articles. Some examples:
"Reducing the Python startup time"
http://www.draketo.de/book/export/html/498
=> "The python startup time always nagged me (17-30ms) and I just
searched again for a way to reduce it, when I found this: The
Python-Launcher caches GTK imports and forks new processes to reduce
the startup time of python GUI programs."
https://nelsonslog.wordpress.com/2013/04/08/python-startup-time/
=> "Wow, Python startup time is worse than I thought."
"How to speed up python starting up and/or reduce file search while
loading libraries?"
https://stackoverflow.com/questions/15474160/how-to-speed-up-python-startin…
=> "The first time I log to the system and start one command it takes
6 seconds just to show a few line of help. If I immediately issue the
same command again it takes 0.1s. After a couple of minutes it gets
back to 6s. (proof of short-lived cache)"
"How does one optimise the startup of a Python script/program?"
https://www.quora.com/How-does-one-optimise-the-startup-of-a-Python-script-…
=> "I wrote a Python program that would be used very often (imagine
'cd' or 'ls') for very short runtimes, how would I make it start up as
fast as possible?"
"Python Interpreter Startup time"
https://bytes.com/topic/python/answers/34469-pyhton-interpreter-startup-time
"Python is very slow to start on Windows 7"
https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-o…
=> "Python takes 17 times longer to load on my Windows 7 machine than
Ubuntu 14.04 running on a VM"
=> "returns in 0.614s on Windows and 0.036s on Linux"
"How to make a fast command line tool in Python" (old article Python 2.5.2)
https://files.bemusement.org/talks/OSDC2008-FastPython/
=> "(...) some techniques Bazaar uses to start quickly, such as lazy imports."
--
So please continue efforts for make Python startup even faster to beat
all other programming languages, and finally convince Mercurial to
upgrade ;-)
Victor
Hi folks,
As some people here know I've been working off and on for a while to
improve CPython's support of Cygwin. I'm motivated in part by a need
to have software working on Python 3.x on Cygwin for the foreseeable
future, preferably with minimal graft. (As an incidental side-effect
Python's test suite--especially of system-level functionality--serves
as an interesting test suite for Cygwin itself too.)
This is partly what motivated PEP 539 [1], although that PEP had the
advantage of benefiting other POSIX-compatible platforms as well (and
in fact was fixing an aspect of CPython that made it unfriendly to
supporting other platforms).
As far as I can tell, the first commit to Python to add any kind of
support for Cygwin was made by Guido (committing a contributed patch)
back in 1999 [2]. Since then, bits and pieces have been added for
Cygwin's benefit over time, with varying degrees of impact in terms of
#ifdefs and the like (for the most part Cygwin does not require *much*
in the way of special support, but it does have some differences from
a "normal" POSIX-compliant platform, such as the possibility for
case-insensitive filesystems and executables that end in .exe). I
don't know whether it's ever been "officially supported" but someone
with a longer memory of the project can comment on that. I'm not sure
if it was discussed at all or not in the context of PEP 11.
I have personally put in a fair amount of effort already in either
fixing issues on Cygwin (many of these issues also impact MinGW), or
more often than not fixing issues in the CPython test suite on
Cygwin--these are mostly tests that are broken due to invalid
assumptions about the platform (for example, that there is always a
"root" user with uid=0; this is not the case on Cygwin). In other
cases some tests need to be skipped or worked around due to
platform-specific bugs, and Cygwin is hardly the only case of this in
the test suite.
I also have an experimental AppVeyor configuration for running the
tests on Cygwin [3], as well as an experimental buildbot (not
available on the internet, but working). These currently rely on a
custom branch that includes fixes needed for the test suite to run to
completion without crashing or hanging (e.g.
https://bugs.python.org/issue31885). It would be nice to add this as
an official buildbot, but I'm not sure if it makes sense to do that
until it's "green", or at least not crashing. I have several other
patches to the tests toward this goal, and am currently down to ~22
tests failing.
Before I do any more work on this, however, it would be best to once
and for all clarify the support for Cygwin in CPython, as it has never
been "officially supported" nor unsupported--this way we can avoid
having this discussion every time a patch related to Cygwin comes up.
I could provide some arguments for why I believe Cygwin should
supported, but before this gets too long I'd just like to float the
idea of having the discussion in the first place. It's also not
exactly clear to me how to meet the standards in PEP 11 for supporting
a platform--in particular it's not clear when a buildbot is considered
"stable", or how to achieve that without getting necessary fixes
merged into the main branch in the first place.
Thanks,
Erik
[1] https://www.python.org/dev/peps/pep-0539/
[2] https://github.com/python/cpython/commit/717d1fdf2acbef5e6b47d9b4dcf48ef182…
[3] https://ci.appveyor.com/project/embray/cpython
Hi,
I have been asked to express myself on the PEP 572. I'm not sure that
it's useful, but here is my personal opinion on the proposed
"assignment expressions".
PEP 572 -- Assignment Expressions:
https://www.python.org/dev/peps/pep-0572/
First of all, I concur with others: Chris Angelico did a great job to
design a good and full PEP, and a working implementation which is also
useful to play with it!
WARNING! I was (strongly) opposed to PEP 448 Unpacking Generalizations
(ex: [1, 2, *list]) and PEP 498 f-string (f"Hello {name}"), whereas I
am now a happy user of these new syntaxes. So I'm not sure that I have
good tastes :-)
Tim Peter gaves the following example. "LONG" version:
diff = x - x_base
if diff:
g = gcd(diff, n)
if g > 1:
return g
versus the "SHORT" version:
if (diff := x - x_base) and (g := gcd(diff, n)) > 1:
return g
== Write ==
If your job is to write code: the SHORT version can be preferred since
it's closer to what you have in mind and the code is shorter. When you
read your own code, it seems straightforward and you like to see
everything on the same line.
The LONG version looks like your expressiveness is limited by the
computer. It's like having to use simple words when you talk to a
child, because a child is unable to understand more subtle and
advanced sentences. You want to write beautiful code for adults,
right?
== Read and Understand ==
In my professional experience, I spent most of my time on reading
code, rather than writing code. By reading, I mean: try to understand
why this specific bug that cannot occur... is always reproduced by the
customer, whereas we fail to reproduce it in our test lab :-) This bug
is impossible, you know it, right?
So let's say that you never read the example before, and it has a bug.
By "reading the code", I really mean understanding here. In your
opinion, which version is easier to *understand*, without actually
running the code?
IMHO the LONG version is simpler to understand, since the code is
straightforward, it's easy to "guess" the *control flow* (guess in
which order instructions will be executed).
Print the code on paper and try to draw lines to follow the control
flow. It may be easier to understand how SHORT is more complex to
understand than LONG.
== Debug ==
Now let's imagine that you can run the code (someone succeeded to
reproduce the bug in the test lab!). Since it has a bug, you now
likely want to try to understand why the bug occurs using a debugger.
Sadly, most debugger are designed as if a single line of code can only
execute a single instruction. I tried pdb: you cannot only run (diff
:= x - x_base) and then get "diff" value, before running the second
assingment, you can only execute the *full line* at once.
I would say that the LONG version is easier to debug, at least using pdb.
I'm using regularly gdb which implements the "step" command as I
expect (don't execute the full line, execute sub expressions one by
one), but it's still harder to follow the control flow when a single
line contains multiple instructions, than debugging lines with a
single instruction.
You can see it as a limitation of pdb, but many tools only have the
granularity of whole line. Think about tracebacks. If you get an
exception at "line 1" in the SHORT example (the long "if" expression),
what can you deduce from the line number? What happened?
If you get an exception in the LONG example, the line number gives you
a little bit more information... maybe just enough to understand the
bug?
Example showing the pdb limitation:
>>> def f():
... breakpoint()
... if (x:=1) and (y:=2): pass
...
>>> f()
> <stdin>(3)f()
(Pdb) p x
*** NameError: name 'x' is not defined
(Pdb) p y
*** NameError: name 'y' is not defined
(Pdb) step
--Return--
> <stdin>(3)f()->None
(Pdb) p x
1
(Pdb) p y
2
... oh, pdb gone too far. I expected a break after "x := 1" and before
"y := 2" :-(
== Write code for babies! ==
Please don't write code for yourself, but write code for babies! :-)
These babies are going to maintain your code for the next 5 years,
while you moved to a different team or project in the meanwhile. Be
kind with your coworkers and juniors!
I'm trying to write a single instruction per line whenever possible,
even if the used language allows me much more complex expressions.
Even if the C language allows assignments in if, I avoid them, because
I regularly have to debug my own code in gdb ;-)
Now the question is which Python are allowed for babies. I recall that
a colleague was surprised and confused by context managers. Does it
mean that try/finally should be preferred? What about f'Hello
{name.title()}' which calls a method into a "string" (formatting)? Or
metaclasses? I guess that the limit should depend on your team, and
may be explained in the coding style designed by your whole team?
Victor
[Victor Stinner]
...
> Tim Peter gaves the following example. "LONG" version:
>
> diff = x - x_base
> if diff:
> g = gcd(diff, n)
> if g > 1:
> return g
>
> versus the "SHORT" version:
>
> if (diff := x - x_base) and (g := gcd(diff, n)) > 1:
> return g
>
> == Write ==
>
> If your job is to write code: the SHORT version can be preferred since
> it's closer to what you have in mind and the code is shorter. When you
> read your own code, it seems straightforward and you like to see
> everything on the same line.
All so, but a bit more: in context, this is just one block in a
complex algorithm. The amount of _vertical_ screen space it consumes
directly affects how much of what comes before and after it can be
seen without scrolling. Understanding this one block in isolation is
approximately useless unless you can also see how it fits into the
whole. Saving 3 lines of 5 is substantial, but it's more often saving
1 of 5 or 6. Regardless, they add up.
> The LONG version looks like your expressiveness is limited by the
> computer. It's like having to use simple words when you talk to a
> child, because a child is unable to understand more subtle and
> advanced sentences. You want to write beautiful code for adults,
> right?
I want _the whole_ to be as transparent as possible. That's a
complicated balancing act in practice.
> == Read and Understand ==
>
> In my professional experience, I spent most of my time on reading
> code, rather than writing code. By reading, I mean: try to understand
> why this specific bug that cannot occur... is always reproduced by the
> customer, whereas we fail to reproduce it in our test lab :-) This bug
> is impossible, you know it, right?
>
> So let's say that you never read the example before, and it has a bug.
Then you're screwed - pay me to fix it ;-) Seriously, as above, this
block on its own is senseless without understanding both the
mathematics behind what it's doing, and on how all the code before it
picked `x` and `x_base` to begin with.
> By "reading the code", I really mean understanding here. In your
> opinion, which version is easier to *understand*, without actually
> running the code?
Honestly, I find the shorter version a bit easier to understand:
fewer indentation levels, and less semantically empty repetition of
names.
> IMHO the LONG version is simpler to understand, since the code is
> straightforward, it's easy to "guess" the *control flow* (guess in
> which order instructions will be executed).
You're saying you don't know that in "x and y" Python evaluates x
first, and only evaluates y if x "is truthy"? Sorry, but this seems
trivial to me in either spelling.
> Print the code on paper and try to draw lines to follow the control
> flow. It may be easier to understand how SHORT is more complex to
> understand than LONG.
Since they're semantically identical, there's _something_ suspect
about a conclusion that one is _necessarily_ harder to understand than
the other ;-) I don't have a problem with you finding the longer
version easier to understand, but I do have a problem if you have a
problem with me finding the shorter easier.
> == Debug ==
>
> Now let's imagine that you can run the code (someone succeeded to
> reproduce the bug in the test lab!). Since it has a bug, you now
> likely want to try to understand why the bug occurs using a debugger.
>
> Sadly, most debugger are designed as if a single line of code can only
> execute a single instruction. I tried pdb: you cannot only run (diff
> := x - x_base) and then get "diff" value, before running the second
> assingment, you can only execute the *full line* at once.
>
> I would say that the LONG version is easier to debug, at least using pdb.
That might be a good reason to avoid, say, list comprehensions (highly
complex expressions of just about any kind), but I think this
overlooks the primary _point_ of "binding expressions": to give names
to intermediate results. I couldn't care less if pdb executes the
whole "if" statement in one gulp, because I get exactly the same info
either way: the names `diff` and `g` bound to the results of the
expressions they named. What actual difference does it make whether
pdb binds the names one at a time, or both, before it returns to the
prompt?
Binding expressions are debugger-friendly in that they _don't_ just
vanish without a trace. It's their purpose to _capture_ the values of
the expressions they name. Indeed, you may want to add them all over
the place inside expressions, never intending to use the names, just
so that you can see otherwise-ephemeral intra-expression results in
your debugger ;-)
> ... Think about tracebacks. If you get an xception at "line 1" in the
> SHORT example (the long "if" expression), what can you deduce
> from the line number? What happened?
>
> If you get an exception in the LONG example, the line number gives you
> a little bit more information... maybe just enough to understand the
> bug?
This one I wholly agree with, in general. In the specific example at
hand, it's weak, because there's so little that _could_ raise an
exception. For example, if the variables weren't bound to integers,
in context the code would have blown up long before reaching this
block. Python ints are unbounded, so overflow in "-" or "gcd" aren't
possible either. MemoryError is theoretically possible, and in that
case it would be good to know whether it happened during "-" or during
"gcd()". Good to know, but not really helpful, because either way you
ran out of memory :-(
> == Write code for babies! ==
>
> Please don't write code for yourself, but write code for babies! :-)
>
> These babies are going to maintain your code for the next 5 years,
> while you moved to a different team or project in the meanwhile. Be
> kind with your coworkers and juniors!
>
> I'm trying to write a single instruction per line whenever possible,
> even if the used language allows me much more complex expressions.
> Even if the C language allows assignments in if, I avoid them, because
> I regularly have to debug my own code in gdb ;-)
>
> Now the question is which Python are allowed for babies. I recall that
> a colleague was surprised and confused by context managers. Does it
> mean that try/finally should be preferred? What about f'Hello
> {name.title()}' which calls a method into a "string" (formatting)? Or
> metaclasses? I guess that the limit should depend on your team, and
> may be explained in the coding style designed by your whole team?
It's the kind of thing I prefer to leave to team style guides, because
consensus will never be reached. In a different recent thread,
someone complained about using functions at all, because their names
are never wholly accurate, and in any case they hide what's "really"
going on. To my eyes, that was an unreasonably extreme "write code
for babies" position.
If a style guide banned using "and" or "or" in Python "if" or "while"
tests, I'd find that less extreme, but also unreasonable.
But if a style guide banned functions with more than 50 formal
arguments, I'd find that unreasonably tolerant.
Luckily, I only have to write code for me now, so am free to pick the
perfect compromise in every case ;-)
On 2018-05-19 15:29, Nick Coghlan wrote:
> That's not how code reviews work, as their complexity is governed by the
> number of lines changed (added/removed/modified), not just the number of
> lines that are left at the end.
Of course, you are right. I didn't mean literally that only the end
result matters. But it should certainly be considered.
If you only do small incremental changes, complexity tends to build up
because choices which are locally optimal are not always globally
optimal. Sometimes you need to do some refactoring to revisit some of
that complexity. This is part of what PEP 575 does.
> That said, "deletes more lines than it
> adds" is typically a point strongly in favour of a particular change.
This certainly won't be true for my patch, because there is a lot of
code that I need to support for backwards compatibility (all the old
code for method_descriptor in particular).
Going back to the review of PEP 575, I see the following possible outcomes:
(A) Accept it as is (possibly with minor changes).
(B) Accept the general idea but split the details up in several PEPs
which can still be discussed individually.
(C) Accept a minimal variant of PEP 575, only changing existing classes
but not changing the class hierarchy.
(D) Accept some yet-to-be-written variant of PEP 575.
(E) Don't fix the use case that PEP 575 wants to address.
Petr Viktorin suggests (C). I am personally quite hesitant because that
only adds complexity and it wouldn't be the best choice for the future
maintainability of CPython. I also fear that this hypothetical PEP
variant would be rejected because of that reason. Of course, if there is
some general agreement that (C) is the way to go, then that is fine for me.
If people feel that PEP 575 is currently too complex, I think that (B)
is a very good compromise. The end result would be the same as what PEP
575 proposes. Instead of changing many things at once, we could handle
each class in a separate PEP. But the motivation of those mini-PEPs will
still be PEP 575. So, in order for this to make sense, the general idea
of PEP 575 needs to be accepted: adding a base_function base class and
making various existing classes subclasses of that.
Jeroen.
Hi,
I fixed a few tests which failed randomly. There are still a few, but
the most annoying have been fixed. I will continue to keep an eye on
our CIs: Travis CI, AppVeyor and buildbots. Sometimes, even when I
report a regression, the author doesn't fix the bug. But when a test
fails on Travis CI or AppVeyor, we cannot merge a pull request, so it
blocks our whole workflow.
I will restart the policy that I proposed last year: if a change
introduces a regression and I'm unable to fix it quickly (say in less
than 2 hours and the author isn't available), I will simply revert the
change. Please don't take it personally, the purpose is just to
unblock our workflow and be able to detect other regressions. It's
just an opportunity for the change author to fix the change without
the pressure of the broken CI.
Buildbots only send email notifications to buildbot-status(a)python.org
when the state changes from success (green) to failure (red). It's
much simpler for me to spot a regression when most buildbots are
green.
By the way, while miss-lington is really an amazing tool (thanks
Mariatta!!!), most core developers (including me!) forgot that 2 years
ago, the policy was to check if a change doesn't break buildbots
before backporting a change (well, 2 years ago we forward-ported
changes, but it's the same idea ;-)). Please remind that we only run
tests on Linux and Windows as pre-commit checks on GitHub pull
requests, whereas it's very common to spot bugs on buildbots thanks to
the diversity of platforms and architectures (and different
performance).
All buildbot builders can be watched at:
http://buildbot.python.org/all/#/builders
But I prefer to watch notifications on IRC (#python-dev on Freenode)
and the buildbot-status mailing list.
https://mail.python.org/mm3/mailman3/lists/buildbot-status.python.org/
--
Night gathers, AND NOW MY WATCH BEGINS.
IT SHALL NOT END UNTIL MY DEATH.
I shall take no wife, hold no lands, father no children.
I shall wear no crowns and win no glory.
I shall live and die at my post.
I am the sword in the darkness.
I am the watcher on the walls.
I am the fire that burns against cold, the light that brings the dawn,
the horn that wakes the sleepers, the shield that guards the realms of
men.
I pledge my life and honor to the Night's Watch, for this night and
all the nights to come.
Victor
Hi,
tl; dr I will withdraw the PEP 546 in one week if noboy shows up to
finish the implementation.
Last year,I wrote the PEP 546 with Cory Benfield:
"Backport ssl.MemoryBIO and ssl.SSLObject to Python 2.7"
https://www.python.org/dev/peps/pep-0546/
The plan was to get a Python 2.7 implementation of Cory's PEP 543:
"A Unified TLS API for Python"
https://www.python.org/dev/peps/pep-0543/
Sadly, it seems like Cory is no longer available to work on the projec
(PEP 543 is still a draft)t.
The PEP 546 is implemented:
https://github.com/python/cpython/pull/2133
Well, I closed it, but you can still get it as a patch with:
https://patch-diff.githubusercontent.com/raw/python/cpython/pull/2133.patch
But tests fail on Travis CI whereas I'm unable to reproduce the issue
on my laptop (on Fedora). The failure seems to depend on the version
of OpenSSL. Christian Heimes has a "multissl" tool which automates
tests on multiple OpenSSL versions, but I failed to find time to try
this tool.
Time flies and one year later, the PR of the PEP 546 is still not
merged, tests are still failing.
One month ago, when 2.7.15 has been released, Benjamin Peterson,
Python 2.7 release manager, simply proposed:
"The lack of movement for a year makes me wonder if PEP 546 should be
moved to Withdrawn status."
Since again, I failed to find time to look at the test_ssl failure, I
plan to withdraw the PEP next week if nobody shows up :-( Sorry Python
2.7!
Does anyone would benefit of MemoryBIO in Python 2.7? Twisted,
asyncio, trio, urllib3, anyone else? If yes, who is volunteer to
finish the MemoryBIO backport (and maintain it)?
Victor
Hi,
since dict keys are sorted by their insertion order since Python 3.6 and that
it’s part of Python specs since 3.7 a proposal has been made in bpo-33462 to
add the __reversed__ method to dict and dict views.
Concerns have been raised in the comments that this feature may add too much
bloat in the core interpreter and be harmful for other Python implementations.
Given the different issues this change creates, I see three possibilities:
1. Accept the proposal has it is for dict and dict views, this would add about
300 lines and three new types in dictobject.c
2. Accept the proposal only for dict, this would add about 80 lines and one
new type in dictobject.c while still being useful for some use cases
3. Drop the proposal as the whole, while having some use, reversed(dict(a=1, b=2))
may not be very common and could be done using OrderedDict instead.
What’s your stance on the issue ?
Best regards,
Rémi Lapeyre