[Python-Dev] PEP 455 -- TransformDict
raymond.hettinger at gmail.com
Thu May 14 16:29:55 CEST 2015
Before the Python 3.5 feature freeze, I should step-up and
formally reject PEP 455 for "Adding a key-transforming
dictionary to collections".
I had completed an involved review effort a long time ago
and I apologize for the delay in making the pronouncement.
What made it a interesting choice from the outset is that the
idea of a "transformation" is an enticing concept that seems
full of possibility. I spent a good deal of time exploring
what could be done with it but found that it mostly fell short
of its promise.
There were many issues. Here are some that were at the top:
* Most use cases don't need or want the reverse lookup feature
(what is wanted is a set of one-way canonicalization functions).
Those that do would want to have a choice of what is saved
(first stored, last stored, n most recent, a set of all inputs,
a list of all inputs, nothing, etc). In database terms, it
models a many-to-one table (the canonicalization or
transformation function) with the one being a primary key into
another possibly surjective table of two columns (the
key/value store). A surjection into another surjection isn't
inherently reversible in a useful way, nor does it seem to be a
common way to model data.
* People are creative at coming up with using cases for the TD
but then find that the resulting code is less clear, slower,
less intuitive, more memory intensive, and harder to debug than
just using a plain dict with a function call before the lookup:
d[func(key)]. It was challenging to find any existing code
that would be made better by the availability of the TD.
* The TD seems to be all about combining data scrubbing
(case-folding, unicode canonicalization, type-folding, object
identity, unit-conversion, or finding a canonical member of an
equivalence class) with a mapping (looking-up a value for a
given key). Those two operations are conceptually orthogonal.
The former doesn't get easier when hidden behind a mapping API
and the latter loses the flexibility of choosing your preferred
mapping (an ordereddict, a persistentdict, a chainmap, etc) and
the flexibility of establishing your own rules for whether and
how to do a reverse lookup.
P.S. Besides the core conceptual issues listed above, there
are a number of smaller issues with the TD that surfaced
during design review sessions. In no particular order, here
are a few of the observations:
* It seems to require above average skill to figure-out what
can be used as a transform function. It is more
expert-friendly than beginner friendly. It takes a little
while to get used to it. It wasn't self-evident that
transformations happen both when a key is stored and again
when it is looked-up (contrast this with key-functions for
sorting which are called at most once per key).
* The name, TransformDict, suggests that it might transform the
value instead of the key or that it might transform the
dictionary into something else. The name TransformDict is so
general that it would be hard to discover when faced with a
specific problem. The name also limits perception of what
could be done with it (i.e. a function that logs accesses
but doesn't actually change the key).
* The tool doesn't self describe itself well. Looking at the
help(), or the __repr__(), or the tooltips did not provide
much insight or clarity. The dir() shows many of the
_abc implementation details rather than the API itself.
* The original key is stored and if you change it, the change
isn't stored. The _original dict is private (perhaps to
reduce the risk of putting the TD in an inconsistent state)
but this limits access to the stored data.
* The TD is unsuitable for bijections because the API is
inherently biased with a rich group of operators and methods
for forward lookup but has only one method for reverse lookup.
* The reverse feature is hard to find (getitem vs __getitem__)
and its output pair is surprising and a bit awkward to use.
It provides only one accessor method rather that the full
dict API that would be given by a second dictionary. The
API hides the fact that there are two underlying dictionaries.
* It was surprising that when d[k] failed, it failed with
transformation exception rather than a KeyError, violating
the expectations of the calling code (for example, if the
transformation function is int(), the call d["12"]
transforms to d and either succeeds in returning a value
or in raising a KeyError, but the call d["12.0"] fails with
a TypeError). The latter issue limits its substitutability
into existing code that expects real mappings and for
exposing to end-users as if it were a normal dictionary.
* There were other issues with dict invariants as well and
these affected substitutability in a sometimes subtle way.
For example, the TD does not work with __missing__().
Also, "k in td" does not imply that "k in list(td.keys())".
* The API is at odds with wanting to access the transformations.
You pay a transformation cost both when storing and when
looking up, but you can't access the transformed value itself.
For example, if the transformation is a function that scrubs
hand entered mailing addresses and puts them into a standard
format with standard abbreviations, you have no way of getting
back to the cleaned-up address.
* One design reviewer summarized her thoughts like this:
"There is a learning curve to be climbed to figure out what
it does, how to use it, and what the applications [are].
But, the [working out the same] examplea with plain dicts
requires only basic knowledge." -- Patricia
More information about the Python-Dev