[Python-ideas] Adding "+" and "+=" operators to dict

Fri Feb 13 00:59:24 CET 2015

On Thursday, February 12, 2015 1:08 AM, Ian Lee <ianlee1521 at gmail.com> wrote:

>Alright, I've tried to gather up all of the feedback and organize it in something approaching the alpha draft of a PEP might look like::
>
>Proposed New Methods on dict
>============================

Hold on. If you really only want these to work on dict, as opposed to dict subclasses and collections.abc.Mapping subclasses, you don't need the whole section on which type wins, and you can make everything a whole lot simpler.

On the other hand, if you _do_ want them to work on all mappings, your definition doesn't work at all.

>Adds two dicts together, returning a new object with the type of the left hand
>operand. This would be roughly equivalent to calling:
>
>
>    >>> new_dict = old_dict.copy();
>    >>> new_dict.update(other_dict)

The first problem is that (as you noticed), dict.copy returns a dict, not a type(self), and it copies directly from the underlying dict storage (if you subclass dict and override __fooitem__ but not copy).

I think you can ignore this problem. Note that set.__or__ always returns a set, not a type(self), so set subclasses don't preserve their type with operators. If this is good enough for set.__or__, and for dict.copy, it's probably good enough for dict.__add__, right? More generally, whenever you subclass a builtin type and call its (non-overridden) builtin methods, it generally ignores your subclassiness (e.g., call __add__ on a subclass of int, and you get back an int, not an instance of your subclass).

The bigger problem is that Mapping.copy doesn't exist, and of course Mapping.update doesn't exist, because that's a mutating method (and surely you wouldn't want to define a non-mutating method like + to only exist on MutableMapping?).

This one is harder to dismiss. If you want to define this on Mapping, you can't just hand-wave it and say "it's like calling copy and update if those methods existed and worked this way", because you're going to have to write the actual implementation. Which means you might as well define it as (possibly a simplified version of) whatever you end up writing, and skip the handwaving.

Fortunately, Set.__or__ already gives us a solution, but it's not quite as trivial as what you're suggesting: just construct a new dict from the key-value pairs of both dicts.

Unfortunately, that construction isn't guaranteed behavior for a Mapping, just as construction from an iterable of values isn't guaranteed for a Set; Set handles this by providing a _from_iterable* classmethod (that defaults to just calling the constructor, but can be overridden if that isn't appropriate) to handle the rare cases where it needs to be violated. So, all you need is this:

@classmethod
def _from_iterable(cls, it):
return cls(it)
def __add__(self, rhs):
return self._from_iterable(kv for m in (self, rhs) for kv in m.items())

This gives you an automatic and obvious answer to all the potentially bikesheddable questions:

1. The type is up to type(lhs) to decide, but will default to type(lhs) itself.

2. How duplicate keys are handled is up to type(lhs), but any dict-like type will keep the rightmost.

And it will just work with any reasonable Mapping, or even most unreasonable ones.** OrderedDict will keep the order, Counter could remove its __add__ override if you gave it a _from_iterable,*** etc.

This also means you probably don't need __radd__.

However, Set.__or__ had the advantage that before it was added, Set, and collection ABCs, didn't exist at all, set itself was pretty new, set.__or__ already existed, etc., so there were few if any "legacy" set classes that Set.__or__ would have to work with. Mapping.__add__ doesn't have that advantage. So, let's go over the cases.

But first, note that existing code using legacy mappng classes will be fine, because existing code will never use + on mappings. It's only if you want to write new code, that uses +, with legacy mapping classes that there's a potential problem.

1. Existing dict subclass: Will just work, although it will return a dict, not an instance of the subclass. (As mentioned at the top, I think this is fine.)

2. Existing collections.UserDict subclass: Will just work.

3. Existing collections.abc.Mapping subclass: If its constructor does the right thing with an iterable of key-value pairs (as most, but not all, such types do), it will work out of the box. If it doesn't accept such an iterable, you'll get a TypeError, which seems reasonable--you need to update your Mapping class, because it's not actually a valid Mapping for use in new code. The problem is that if it accepts such an iterable but does the wrong thing (see Counter), you may have a hard-to-debug problem on your hands. I think this is rare enough that you can just mention it in the PEP (and maybe a footnote or comment in the Added section in the docs) and dismiss it, but there's room for argument here.

4. None of the above, but follows the mapping protocol (including any code that still uses the old UserDict recipe instead of the stdlib version): Will raise a TypeError. Which is fine. The only potential problem is that someone might call collections.abc.Mapping.register(MyLegacyMapping), which would now be a lie, but I think that can be mentioned in the PEP and dismissed.

Finally, what about performance?

Unfortunately, my implementation is much slower than a copy/update solution. From a quick test of a class that delegates to a plain dict in the obvious way, it converges to about 3x slower at large N. You can test the same thing without the class:

def a1(d1, d2):
d = d1.copy()
d.update(d2)
return d

def a2(d1, d2):
return type(d1)(kv for m in (d1, d2) for kv in m.items())

d = {i:i for i in range(1000000)}
%timeit a1(d, d)
%timeit a2(d, d)

You could provide an optimized MutableMapping.__add__ to handle this (using copy.copy to deal with the fact that copy isn't guaranteed to be there****), but I don't think it's really necessary. After all, you could get about the same benefit for MutableSet and MutableSequence operators, but the ABCs don't bother.

But there is another advantage of a copy-and-update implementation: most legacy mappings are MutableMappings, and fewer of them are going to fail to be copyable than to be constructible from an iterable of key-value pairs, and those that fail are less likely to do so silently. So, even though in theory it's no better, in practice it may mean fewer problems.

Maybe it would be simpler to add copy to the Mapping API, in which case MutableMapping.__add__ could just rely on that. But that (re-)raises the problem that dict.copy() returns a dict, not a type(self), so now it's violating the Mapping ABC. And I'm not sure you want to change that. The dict.copy method (and its C API equivalent PyDict_Copy) is used all over the place by the internals of Python (e.g., by type, to deal with attribute dictionaries for classes with the default behavior). And the fact that copy isn't documented to return type(self), and isn't defined on Mapping, implies that it's there really more for uses cases that need it to be fast, rather than because it's considered inherent to the type. 

So, I think you're better off just leaving out the copy/update idea.

Meanwhile, one quick answer:

>Adding mappings of different types
>----------------------------------
>
>
>Here is what currently happens in a few cases in Python 3.4, given:
>
>
>    >>> class A(dict): pass
>    >>> class B(dict): pass
>    >>> foo = A({'a': 1, 'b': 'xyz'})
>    >>> bar = B({'a': 5.0, 'c': object()})
>
>
>Currently (this surprised me actually... I guess it gets the parent class?):
>
>
>    >>> baz = foo.copy()
>    >>> type(baz)
>    dict

As mentioned above, copy isn't part of the Mapping or MutableMapping API; it seems to be provided by dict because it's a useful optimization in many cases, that Python itself uses.

If you look at the CPython implementation in Objects/dictobject.c, dict.copy just calls the C API function PyDict_Copy, which calls PyDict_New then PyDict_Merge. This is basically the same as calling d = dict() then d.update(self), but PyDict_Merge can be significantly faster because it can go right to the raw dict storage for self, and can fast-path the case where rhs is an actual dict as well.

---

* Note that Set._from_iterable is only documented in a footnote in the collections.abc docs, and the same is probably fine for Mapping._from_iterable.

** The craziest type I can I think of is an AbsoluteOrderedDict used for caches that may have to be merged together and still preserve insertion order. I actually built one of these, by keeping a tree of TimestampedTuple(key, value) items, sorted by the tuple's timestamp. I'd have to dig up the code, but I'm pretty sure it would just work with this __add__ implementation.

*** Of course Counter.__add__ might be faster than the generic implementation would be, so you might want to keep it anyway. But the fact that they have the same semantics seems like a point in favor of this definition of Mapping.__add__.

**** Or maybe try self.copy() and then call copy.copy(self) on AttributeError. After all, self.copy() could be significantly faster (e.g., see UserDict.copy).