[Python-Dev] Proposal: dict.with_values(iterable)

Tue Apr 23 12:29:39 EDT 2019

On Wed, Apr 24, 2019 at 12:34 AM Steve Dower <steve.dower at python.org> wrote:
>
> >> If it's a key-sharing dict, then all the keys are strings. We know that
> >> when we go to do the update, so we can intern all the strings (going to
> >> do that anyway) and then it's a quick check if it already exists. If
> >> it's a regular dict, then we calculate hashes as normal. Updating the
> >> value is just a decref, incref and assignment.
> >
> > There are some problem.
> >
> > 1. Searching hash table is not zero-cost, comparing to appending to sequence.
> >     This cost is very near to building new hash tables.
>
> If we know that you're sharing keys with the new items then we can skip
> the search. This was my point about the d2 = copy(d1); d2.update(zip(d2,
> values)) idea:
>

OK, I got it.
But note that zip object doesn't expose items, neither Python level or C level.

> > 2. In my proposal, main user is csv.DictReader or sql.DictCursor.
> >     They parse only values on each rows.   So they need to use map.
>
> In that case, use a private helper. _csv already has a native module. We
> don't need to add new public APIs for internal optimisations, provided
> there is a semantically equivalent way to do it without the internal API.

csv is stdlib.  But there are some third party extensions similar to csv.

>
> > 3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
> >     methods.  There is no obvious place to start using key-sharing dict.
>
> See my reply to Glenn, but potentially fromkeys() could start with the
> key-sharing dict and then copy()/dict() could continue sharing it
> (hopefully they already do?).

Key-sharing dict is used only for instance dict at the moment.

2nd argument of dict.fromkeys() is value, not values.
How about adding dict.fromkeyvalues(keys, values)?
When keys is dict, it's behavior is same to my first proposal
(`dict.with_values(d1, values)`).

> >
> > If *CPython* specialized dict(zip(dict, values)),  it still be CPython
> > implementation detail.
> > Do you want recommend using such CPython hacky optimization?
> > Should we use such optimization in stdlib, even if it will be slower
> > than dict(zip(keys_tuple, values)) on some other Python implementations?
>
> We do "hacky" optimisations everywhere :) The point of the runtime is to
> let users write code that works and we do the effort behind the scenes
> to make it efficient. We're not C - we're here to help our users.

But we avoid CPython-only hack which will make stdlib slower on other
Python implementations as possible.
For example, we optimize `s1 += s` loop.  But we use `''.join(list_of_str)`
instead of it.

>
> The point is that it will work on other implementations - including
> previous versions of CPython - and those are free to optimise it however
> they like.
>
> > Or do you propose making dict(zip(dict, values)) optimization as
> > language specification?
>
> Definitely not! It's just a pattern that we have the ability to
> recognize and optimize at runtime, so why not do it?

Why we need to recommend patterns fast only in CPython?

  d2 = dict.fromkeys(keys_dict)   # make key sharing dict, only in CPython 3.8+
  d2.update(zip(d2, row))  # update values without key lookup, only in
CPython 3.8+

Obviously, this may be much slower than `d2 = dict(zip(keys_tuple, row))` on
current CPython and other Python implementations.

Note that this pattern will be used when dict creation is bottleneck.

If we has specialized API, libraries can use it if the API is available,
and use dict(zip(keys, row)) otherwise.

>
> > One obvious advantage of having DictBuilder is it is for specific
> > purpose.  It has at least same performance to dict(zip(keys, values))
> > on all Python implementations.
> > Libraries like csv parser can use it without worrying about its performance
> > on Python other than CPython.
>
> A singular purpose isn't necessarily an obvious advantage. We're better
> off with generic building blocks that our users can compose in ways that
> were originally non-obvious (and then as patterns emerge we can look at
> ways to simplify or formalise them).

In generic building blocks, we can not know user will create massive dicts
with same keys or just creating one copy.  We need to guess, and the guess
may be wrong.

Regards,

-- 
Inada Naoki  <songofacandy at gmail.com>