[Python-ideas] namedtuple literals [Was: RE a new namedtuple]

Wed Jul 19 20:14:21 EDT 2017

On Tue, Jul 18, 2017 at 6:31 AM, Guido van Rossum <guido at python.org> wrote:

> On Mon, Jul 17, 2017 at 6:25 PM, Eric Snow <ericsnowcurrently at gmail.com>
>  wrote:
>
>> On Mon, Jul 17, 2017 at 6:01 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
>> > Guido has decreed that namedtuple shall be reimplemented with speed in
>> mind.
>>
>> FWIW, I'm sure that any changes to namedtuple will be kept as minimal
>> as possible.  Changes would be limited to the underlying
>> implementation, and would not include the namedtuple() signature, or
>> using metaclasses, etc.  However, I don't presume to speak for Guido
>> or Raymond. :)
>>
>
> Indeed. I referred people here for discussion of ideas like this:
>
> >>> a = (x=1, y=0)
>

Thanks for bringing this up, I'm gonna summarize my idea in form of a
PEP-like draft, hoping to collect some feedback.

Proposal
========

Introduction of a new syntax and builtin function to create lightweight
namedtuples "on the fly" as in:

    >>> (x=10, y=20)
    (x=10, y=20)

    >>> ntuple(x=10, y=20)
    (x=10, y=20)

Motivations
===========

Avoid declaration
-----------------

Other than the startup time cost:
https://mail.python.org/pipermail/python-dev/2017-July/148592.html
...the fact that namedtuples need to be declared upfront implies they
mostly end up being used only in public, end-user APIs / functions. For
generic functions returning more than 1 argument it would be nice to just
do:

    def get_coordinates():
        return (x=10, y=20)

...instead of:

    from collections import namedtuple

    Coordinates = namedtuple('coordinates', ['x', 'y'])

    def get_coordinates():
        return Coordinates(10, 20)

Declaration also has the drawback of unnecessarily polluting the module API
with an object (Coordinates) which is rarely needed. AFAIU namedtuple was
designed this way for efficiency of the pure-python implementation
currently in place and for serialization purposes (e.g. pickle), but I may
be missing something else. Generally namedtuples are declared in a private
module, imported from elsewhere and they are never exposed in the main
namespace, which is kind of annoying. In case of one module scripts it's
not uncommon to add a leading underscore which makes __repr__ uglier. To
me, this suggests that the factory function should have been a first-class
function instead.

Speed
------

Other than the startup declaration overhead, a namedtuple is slower than a
tuple or a C structseq in almost any aspect:

- Declaration (50x slower than cnamedtuple):

    $ python3.7 -m timeit -s "from collections import namedtuple" \
        "namedtuple('Point', ('x', 'y'))"
    1000 loops, best of 5: 264 usec per loop

    $ python3.7 -m timeit -s "from cnamedtuple import namedtuple" \
        "namedtuple('Point', ('x', 'y'))"
    50000 loops, best of 5: 5.27 usec per loop

- Instantiation (3.5x slower than tuple):

    $ python3.7 -m timeit -s "import collections; Point =
collections.namedtuple('Point', ('x', 'y')); x = [1, 2]" "Point(*x)"
    1000000 loops, best of 5: 310 nsec per loop

    $ python3.7 -m timeit -s "x = [1, 2]" "tuple(x)"
    5000000 loops, best of 5: 88 nsec per loop

- Unpacking (2.8x slower than tuple):

    $ python3.7 -m timeit -s "import collections; p =
collections.namedtuple( \
        'Point', ('x', 'y'))(5, 11)" "x, y = p"
    5000000 loops, best of 5: 41.9 nsec per loop

    $ python3.7 -m timeit -s "p = (5, 11)" "x, y = p"
    20000000 loops, best of 5: 14.8 nsec per loop

- Field access by name (1.9x slower than structseq and cnamedtuple):

    $ python3.7 -m timeit -s "from collections import namedtuple as nt; \
        p = nt('Point', ('x', 'y'))(5, 11)" "p.x"
    5000000 loops, best of 5: 42.7 nsec per loop

    $ python3.7 -m timeit -s "from cnamedtuple import namedtuple as nt; \
        p = nt('Point', ('x', 'y'))(5, 11)" "p.x"
    10000000 loops, best of 5: 22.5 nsec per loop

    $ python3.7 -m timeit -s "import os; p = os.times()" "p.user"
    10000000 loops, best of 5: 22.6 nsec per loop

- Field access by index is the same as tuple:

    $ python3.7 -m timeit -s "from collections import namedtuple as nt; \
        p = nt('Point', ('x', 'y'))(5, 11)" "p[0]"
    10000000 loops, best of 5: 20.3 nsec per loop

    $ python3.7 -m timeit -s "p = (5, 11)" "p[0]"
    10000000 loops, best of 5: 20.5 nsec per loop

It is being suggested that most of these complaints about speed aren't an
issue but in certain circumstances such as busy loops, getattr() being 1.9x
slower could make a difference, e.g.:
https://github.com/python/cpython/blob/3e2ad8ec61a322370a6fbdfb2209cf74546f5e08/Lib/asyncio/selector_events.py#L523
Same goes for values unpacking.

isinstance()
------------

Probably a minor complaint, I just bring this up because I recently had to
do this in psutil's unit tests. Anyway, checking a namedtuple instance
isn't exactly straightforward:
https://stackoverflow.com/a/2166841

Backward compatibility
======================

This is probably the biggest barrier other than the "a C implementation is
less maintainable" argument. In order to avoid duplication of functionality
it would be great if collections.namedtuple() could remain a (deprecated)
factory function using ntuple() internally. FWIW I tried running stdlib's
unittests against https://github.com/llllllllll/cnamedtuple, I removed the
ones about "_source", "verbose" and "module" arguments and I get a couple
of errors about __doc__. I'm not sure about more advanced use cases
(subclassing, others...?) but overall it appears pretty doable.

collections.namedtuple() Python wrapper can include the necessary logic to
implement "verbose" and "rename" parameters when they're used. I'm not
entirely sure about the implications of the "module" parameter though
(Raymond?).

_make(), _asdict(), _replace() and _fields attribute should also be
exposed; as for "_source" it appears it can easily be turned into a
property which would also save some memory.

The biggest annoyance is probably fields' __doc__ assignment:
https://github.com/python/cpython/blob/ced36a993fcfd1c76637119d31c03156a8772e11/Lib/selectors.py#L53-L58

...which would require returning a clever class object slowing down the
namedtuple declaration also in case no parameters are passed, but
considering that the long-term plan is the replace collections.namedtuple()
with ntuple() I consider this acceptable.

Thoughts?

--
Giampaolo - http://grodola.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170720/61701b87/attachment-0001.html>