On Wed, Mar 6, 2019 at 12:08 AM Guido van Rossum <guido@python.org> wrote:
On Tue, Mar 5, 2019 at 3:50 PM Josh Rosenberg <shadowranger+pythonideas@gmail.com> wrote:

On Tue, Mar 5, 2019 at 11:16 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 03, 2019 at 09:28:30PM -0500, James Lu wrote:

> I propose that the + sign merge two python dictionaries such that if
> there are conflicting keys, a KeyError is thrown.

This proposal is for a simple, operator-based equivalent to
dict.update() which returns a new dict. dict.update has existed since
Python 1.5 (something like a quarter of a century!) and never grown a
"unique keys" version.

I don't recall even seeing a request for such a feature. If such a
unique keys version is useful, I don't expect it will be useful often.

I have one argument in favor of such a feature: It preserves concatenation semantics. + means one of two things in all code I've ever seen (Python or otherwise):

1. Numeric addition (including element-wise numeric addition as in Counter and numpy arrays)
2. Concatenation (where the result preserves all elements, in order, including, among other guarantees, that len(seq1) + len(seq2) == len(seq1 + seq2))

dict addition that didn't reject non-unique keys wouldn't fit *either* pattern; the main proposal (making it equivalent to left.copy(), followed by .update(right)) would have the left hand side would win on ordering, the right hand side on values, and wouldn't preserve the length invariant of concatenation. At least when repeated keys are rejected, most concatenation invariants are preserved; order is all of the left elements followed by all of the right, and no elements are lost.

I must by now have seen dozens of post complaining about this aspect of the proposal. I think this is just making up rules (e.g. "+ never loses information") to deal with an aspect of the design where a *choice* must be made. This may reflect the Zen of Python's "In the face of ambiguity, refuse the temptation to guess." But really, that's a pretty silly rule (truly, they aren't all winners). Good interface design constantly makes choices in ambiguous situations, because the alternative is constantly asking, and that's just annoying.

We have a plethora of examples (in fact, almost all alternatives considered) of situations related to dict merging where a choice is made between conflicting values for a key, and it's always the value further to the right that wins: from d[k] = v (which overrides the value when k is already in the dict) to d1.update(d2) (which lets the values in d2 win), including the much lauded {**d1, **d2} and even plain {'a': 1, 'a': 2} has a well-defined meaning where the latter value wins.

Yeah. And I'm fine with the behavior for update because the name itself is descriptive; we're spelling out, in English, that we're update-ing the thing it's called on, so it makes sense to have the thing we're sourcing for updates take precedence.

Similarly, for dict literals (and by extension, unpacking), it's following an existing Python convention which doesn't contradict anything else.

Overloading + lacks the clear descriptive aspect of update that describes the goal of the operation, and contradicts conventions (in Python and elsewhere) about how + works (addition or concatenation, and a lot of people don't even like it doing the latter, though I'm not that pedantic).

A couple "rules" from C++ on overloading are "Whenever the meaning of an operator is not obviously clear and undisputed, it should not be overloaded. Instead, provide a function with a well-chosen name." and "Always stick to the operator’s well-known semantics". (Source: https://stackoverflow.com/a/4421708/364696 , though the principle is restated in many other places). Obviously the C++ community isn't perfect on this (see iostream and <</>> operators), but they're otherwise pretty consistent. + means addition, and in many languages including C++ strings, concatenation, but I don't know of any languages outside the "esoteric" category that use it for things that are neither addition nor concatenation. You've said you don't want the whole plethora of set-like behaviors on dicts, but dicts are syntactically and semantically much more like sets than sequences, and if you add + (with semantics differing from both sets and sequences), the language becomes less consistent.

I'm not against making it easier to merge dictionaries. But people seem to be arguing that {**d1, **d2} is bad because of magic punctuation that obscures meaning, when IMO:

     d3 = d1 + d2

is obscuring meaning by adding yet a third rule for what + means, inconsistent with both existing rules (from both Python and the majority of languages I've had cause to use). A named method (class or instance) or top-level function (a la sorted) is more explicit, easier to look up (after all, the major complaint about ** syntax is the difficulty of finding the documentation on it). It's also easier to make it do the right thing; d1 + d2 + d3 + ... dN is inefficient (makes many unnecessary temporaries), {**d1, **d2, **d3, ..., **dN} is efficient but obscure (and not subclass friendly), but a varargs method like dict.combine(d1, d2, d3, ..., dN) (or merge, or whatever; I'm not trying to bikeshed) is correct, efficient, and most importantly, easy to look up documentation for.

I occasionally find it frustrating that concatenation exists given the wealth of Schlemiel the Painter's algorithms it encourages, and the "correct" solution for combining sequences (itertools.chain for general cases, str.join/bytes.join for special cases) being less obvious means my students invariably use the "wrong" tool out of convenience (and it's not really wrong in 90% of code where the lengths are always short, but then they use it where lengths are often huge and suffer for it). If we're going to make dict merging more convenient, I'd prefer we make the obvious, convenient solution also the one that doesn't encourage non-scalable anti-patterns.

As to why raising is worse: First, none of the other situations I listed above raises for conflicts. Second, there's the experience of str+unicode in Python 2, which raises if the str argument contains any non-ASCII bytes. In fact, we disliked it so much that we changed the language incompatibly to deal with it.

Agreed, I don't like raising. It's consistent with + (the only argument in favor of it really), but it's a bad idea, for all the reasons you mention.

- Josh Rosenberg