[Python-3000] Thoughts on dictionary views

Wed Feb 21 01:13:42 CET 2007

On 2/20/07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Raymond Hettinger wrote:
> > The Java concept of dictionary views seems to have caught-on here while I wasn't
> > looking.  At the risk of covering some old ground, I would like to re-open the
> > question.  Here are a few thoughts on the subject to kick-off the discussion:
> >
> > * Maintaining a live (self-updating) view is a bit tricky from an implementation
> > point-of-view.  While it is clearly doable for dictionaries, it is not clear
> > that it is a good idea for a general mapping API which can be wrapped around
> > dbms, shelves, elementtrees, b-trees, and other wrascally rabbits.  I doubt that
> > the underlying structures of other mapping types support the observer pattern
> > necessary to keep views updated -- this is doubly true if the underlying data is
> > on disk and can be updated by other processes, threads, etc.
>
> FWIW, the py3k trunk is still somewhat broken from the implementation of
> this change. Without any test resources enabled, I still get more than
> half a dozen failures which appear at a glance to be related to
> unexpected mutation of dictionaries (test_anydbm, test_dumbdbm,
> test_mutants, test_compile, test_iter, test_iterlen, test_minidom,
> test_os, test_importhooks, test_unittest)

Yes, I plan to have those fixed before PyCon (so we won't have to
waste time on them at the sprint). But if someone wants to help that
would be great!

> > * One of the purported benefits is to provide set-like behavior without the
> > expense of copying to a new set object.  FWIW, I've updated the set
> > implementation to be more interoperable with dictionaries so that the conversion
> > costs are negligible (about the same as a dict resize operation -- one pass, no
> > calls to PyObject_Hash, insertion into a presized, sparse table with very few
> > collisions).
>
> The speed costs may become negligible, but I believe the main concern
> here is memory consumption (minimising memory usage is certainly the
> only reason I've ever made sure to use the dict.iter* methods).

As I said, it's still O(N) time and space, vs. O(1) for creating a view.

[...]
> One of the things that's been suggested for working with strings in Py3k
> is a stringview type - a wrapper type around a string where slicing (and
> similar operations, like partition()) creates objects with a reference &
> offset into the original string rather than actually copying data
> around.

That must've been proposed while *I* was away. :-) I'm not at all
convinced that this kind of complexity is helpful at all. But note
that it's a different case from dict views, since string views can be
turned into copies with only some cost (or savings :-) in time and
space, but without sematic changes. Dict views have semantics.

> Standard strings would be entirely unaffected. The concept of
> use being that a functions would accept a string-like object as an
> object, wrap the stringview around it, then convert the result back to a
> normal string before passing it back to the caller.
>
> Couldn't something similar serve as a replacement for the iter*()
> methods on dictionaries? That is, rather than copying the data into a
> different data structure,

Um, neither iterkeys() nor dict views do any copying. They just
reference the underlying dict in a tiny fixed-size object (literally
one pointer for dict views, a bit more for iterkeys(), in order to
detect mutations to the dict).

> instead provide a group of wrapper classes,
> each of which operate on the wrapped mapping in a different way?
>
> The necessary wrapper classes needed would be:
>    - set API exposing keys of the original mapping
>    - multiset API exposing values of the underlying mapping
>    - keyed set API exposing (key, value) pairs of the original mapping
>    - mapping API that uses the above for keys(), values() & items(), but
>      otherwise delegates operations to the original mapping
>
> All except the last already exist in the Py3k branch (as the role of the
> last suggested wrapper type is currently being handled by the dict data
> type itself)

This is a clear explanation of the implementation; but can you also
explain the benefits?

> Similar to Raymond's suggestion of a new concrete container type (rather
> than a wrapper type), this could be included in the standard library for
> Python 2.6, making it significantly easier to write forward compatible code.

I doubt that writing forward compatible code will be hard anyways. The
most compatible code simply doesn't use any of the six affected
methods (keys(), iterkeys(), etc.) and instead relies on directly
manipulating or iterating over the dict. This is backwards compatible
all the way back to 2.2. Also, the conversion tool will make it easy
to write compatible code that uses iterkeys() but not keys().

> Given such a new dict wrapper type (or standalone container type, if
> Raymond's approach is taken), then it would also be possible to change
> the basic mapping API to define different return types for the 3 methods
> that currently return lists:
>    - keys() would be changed to return a set
>    - values() would be changed to return a multiset (non-hash based)
>    - items() would be changed to return a keyed set (ala sort keys)

I'm not sure what you mean by "ala sort keys". I hope there's no
requirement that items() be sorted.

Note that the implementation of a multiset type will be quite tricky
(I'm punting on this in the rewrite of PEP 3106 that just got
refreshed on python.org).

Also, this implementation will make it hard to ensure that

list(zip(d.keys(), d.values())) == list(d.items())

as the keys() and values() return different object types.

> The difference from the current Py3k branch is that these would still
> involve copying the data from the original dictionary to a new concrete
> container object which is then returned.

But that's the main thing I'm trying to *avoid* with the new API!

> The more I've seen of these discussions (the original dict method one,
> as well as the string concatenation/slicing one), the more leery I
> become of including any view type behaviour in the basic data types.

I can't say I see much similarity between the two discussions. The
issues are all completely different -- for dicts they focus on API
semantics, while for strings they focus on performance in all sorts of
odd cases.

> One factor in this is that I've been getting back into C++ coding
> lately, and keep getting reminded of the various cases where the C++
> standard defaults to behaviours that are faster (use less memory,
> whatever) when they're valid, but silently do the wrong thing when
> they're inappropriate (default assignment and copy constructions
> operators that lead to significant memory double-free problems are a
> nice example, as is the fact that methods are non-virtual by default).
> So rather than spend the time to figure out whether or not the default
> behaviour is safe for each case, it becomes quicker and easier to just
> stick in the boilerplate to tell the compiler "don't do the default
> thing, it is probably wrong", thus completely invalidating the supposed
> performance improvement that was meant to be provided by the default
> behaviour (and requiring a programmer to put in a comment saying so when
> the default behaviour really is what they want).

Well, unless you plan to get rid of dict.__iter__(), that's a default
behavior that is wrong whenever you mutate the dict in the loop -- but
what are you going to do about it?

> Typically, Python doesn't work that way - it defaults to 'safe'
> behaviour, but provides sufficient flexibility to permit optimisation
> when it is necessary (e.g., with the size of the data sets the NumPy
> folks sling around, view-based behaviour is essential, and Python lets
> them do it that way).
>
> My apologies for rambling a bit - I can't currently give a succinct
> explanation for why the current direction feels wrong, but I felt it was
> worth supporting Raymond on this point.

Apologies accepted -- but yes, you did ramble a bit, and I still wish
you'd collected your thoughts a bit more. if there are simple clear
arguments it's easier for me to accept or reject them than with a
bunch of ramblings. Sorry to be grumpy, but given the implementation
stage this is in and how long the PEP has been sitting unchanged I'm a
bit annoyed that the criticism, valid or not, comes so late.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)