[Python-3000] Thoughts on dictionary views

Tue Feb 20 14:24:35 CET 2007

Raymond Hettinger wrote:
> The Java concept of dictionary views seems to have caught-on here while I wasn't 
> looking.  At the risk of covering some old ground, I would like to re-open the 
> question.  Here are a few thoughts on the subject to kick-off the discussion:
> 
> * Maintaining a live (self-updating) view is a bit tricky from an implementation 
> point-of-view.  While it is clearly doable for dictionaries, it is not clear 
> that it is a good idea for a general mapping API which can be wrapped around 
> dbms, shelves, elementtrees, b-trees, and other wrascally rabbits.  I doubt that 
> the underlying structures of other mapping types support the observer pattern 
> necessary to keep views updated -- this is doubly true if the underlying data is 
> on disk and can be updated by other processes, threads, etc.

FWIW, the py3k trunk is still somewhat broken from the implementation of 
this change. Without any test resources enabled, I still get more than 
half a dozen failures which appear at a glance to be related to 
unexpected mutation of dictionaries (test_anydbm, test_dumbdbm, 
test_mutants, test_compile, test_iter, test_iterlen, test_minidom, 
test_os, test_importhooks, test_unittest)

> 
> * One of the purported benefits is to provide set-like behavior without the 
> expense of copying to a new set object.  FWIW, I've updated the set 
> implementation to be more interoperable with dictionaries so that the conversion 
> costs are negligible (about the same as a dict resize operation -- one pass, no 
> calls to PyObject_Hash, insertion into a presized, sparse table with very few 
> collisions).

The speed costs may become negligible, but I believe the main concern 
here is memory consumption (minimising memory usage is certainly the 
only reason I've ever made sure to use the dict.iter* methods).

However, the string discussion has given me another view on that front, 
too... (more on that below)

> My recommendation is to take a more conservative route.  Let's make dicts as 
> simple as possible and then introduce a new collections module entry with the 
> views bells and whistles.  If the collections version proves itself as 
> enormously popular, useful, understandable, and without a good equivalent, then 
> it can ask for a promotion.  The collections module is nice place to put in 
> alternate datatypes that meet the more demanding needs of advanced users who 
> know exactly what they want/need in terms of special behaviors or performance. 
> And, if we take the collections module route, there is no reason that it cannot 
> be put into Py2.6 where people will either flock to it or ignore it, with either 
> result providing us with good guidance for Py3.0.

One of the things that's been suggested for working with strings in Py3k 
is a stringview type - a wrapper type around a string where slicing (and 
similar operations, like partition()) creates objects with a reference & 
offset into the original string rather than actually copying data 
around. Standard strings would be entirely unaffected. The concept of 
use being that a functions would accept a string-like object as an 
object, wrap the stringview around it, then convert the result back to a 
normal string before passing it back to the caller.

Couldn't something similar serve as a replacement for the iter*() 
methods on dictionaries? That is, rather than copying the data into a 
different data structure, instead provide a group of wrapper classes, 
each of which operate on the wrapped mapping in a different way?

The necessary wrapper classes needed would be:
   - set API exposing keys of the original mapping
   - multiset API exposing values of the underlying mapping
   - keyed set API exposing (key, value) pairs of the original mapping
   - mapping API that uses the above for keys(), values() & items(), but
     otherwise delegates operations to the original mapping

All except the last already exist in the Py3k branch (as the role of the 
last suggested wrapper type is currently being handled by the dict data 
type itself)

Similar to Raymond's suggestion of a new concrete container type (rather 
than a wrapper type), this could be included in the standard library for 
Python 2.6, making it significantly easier to write forward compatible code.

Given such a new dict wrapper type (or standalone container type, if 
Raymond's approach is taken), then it would also be possible to change 
the basic mapping API to define different return types for the 3 methods 
that currently return lists:
   - keys() would be changed to return a set
   - values() would be changed to return a multiset (non-hash based)
   - items() would be changed to return a keyed set (ala sort keys)

The difference from the current Py3k branch is that these would still 
involve copying the data from the original dictionary to a new concrete 
container object which is then returned.

The more I've seen of these discussions (the original dict method one, 
as well as the string concatenation/slicing one), the more leery I 
become of including any view type behaviour in the basic data types.

One factor in this is that I've been getting back into C++ coding 
lately, and keep getting reminded of the various cases where the C++ 
standard defaults to behaviours that are faster (use less memory, 
whatever) when they're valid, but silently do the wrong thing when 
they're inappropriate (default assignment and copy constructions 
operators that lead to significant memory double-free problems are a 
nice example, as is the fact that methods are non-virtual by default). 
So rather than spend the time to figure out whether or not the default 
behaviour is safe for each case, it becomes quicker and easier to just 
stick in the boilerplate to tell the compiler "don't do the default 
thing, it is probably wrong", thus completely invalidating the supposed 
performance improvement that was meant to be provided by the default 
behaviour (and requiring a programmer to put in a comment saying so when 
the default behaviour really is what they want).

Typically, Python doesn't work that way - it defaults to 'safe' 
behaviour, but provides sufficient flexibility to permit optimisation 
when it is necessary (e.g., with the size of the data sets the NumPy 
folks sling around, view-based behaviour is essential, and Python lets 
them do it that way).

My apologies for rambling a bit - I can't currently give a succinct 
explanation for why the current direction feels wrong, but I felt it was 
worth supporting Raymond on this point.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org