Mailman 3 An identity dict - Python-ideas

An identity dict

Benjamin Peterson

May 30, 2010

3:27 a.m.

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id(). This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash. An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict. Regards, Benjamin

Show replies by date

Lie Ryan

May 2010

9:07 a.m.

On 05/30/10 13:27, Benjamin Peterson wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

what's wrong with dict[id(key)] = foo?

...

An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

Masklinn

10:35 a.m.

On 2010-05-30, at 11:07 , Lie Ryan wrote:

...

...
An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

Actually, I'd say the exact opposite: CPython's identity being cheap is an implementation detail, not the other way around, and that is what shouldn't be relied on. In that light, a special-purpose identity dictionary independent from implementation-specific and artificially low id() computation costs is a pretty good idea.

Lie Ryan

11:14 a.m.

On 05/30/10 20:35, Masklinn wrote:

...

On 2010-05-30, at 11:07 , Lie Ryan wrote:

...
...
An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

Actually, I'd say the exact opposite: CPython's identity being cheap is an implementation detail, not the other way around, and that is what shouldn't be relied on. In that light, a special-purpose identity dictionary independent from implementation-specific and artificially low id() computation costs is a pretty good idea.

the performance of any Python statement/function is implementation detail. Whether any statement/function is fast or cheap shouldn't be relied on. IMHO, adding a brand new collection module just because a Python implementation has a fast/cheap operations isn't a good reason.

Benjamin Peterson

2:23 p.m.

Lie Ryan <lie.1296@...> writes:

...

what's wrong with dict[id(key)] = foo?

For one, you can't get the value of the key out of the dict.

...

that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

A "clutch"? Even ignoring PyPy, an identitydict is still a useful primitive.

Lie Ryan

2:56 p.m.

On 05/31/10 00:23, Benjamin Peterson wrote:

...

Lie Ryan <lie.1296@...> writes:

...
what's wrong with dict[id(key)] = foo?

For one, you can't get the value of the key out of the dict.

mydict[id(key)] = (key, value)

...

...
that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

A "clutch"? Even ignoring PyPy, an identitydict is still a useful primitive.

it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?

Benjamin Peterson

3:20 p.m.

Lie Ryan <lie.1296@...> writes:

...

it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?

Why defaultdict then? It can be implemented just fine with dict.setdefault. identitydict avoids the unnatural method of having the key and value as value of the dictionary, resulting in contortions everytime the dictionary is manipulated.

Raymond Hettinger

4:11 a.m.

On May 30, 2010, at 7:56 AM, Lie Ryan wrote:

...

On 05/31/10 00:23, Benjamin Peterson wrote:

...
Lie Ryan <lie.1296@...> writes:

...
what's wrong with dict[id(key)] = foo?

. . .

...

...
it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?

I agree that this would be a waste. Also, I haven't seen much of a discussion of use cases. It's been possible to make identity dictionaries for a very long time, yet I can't remember the last time I saw one used. The Python language makes very few guarantees about object identity so it is mainly only usable in a few circumstances where an object is continually referenced by something outside the identity dict. I would like to see an identity dict posted an ASPN recipe to see if anyone actually uses it. Raymond

Terry Reedy

6:05 p.m.

It appears that different people are talking about two different ideas of identity dict, with different properties and possible use cased. So the statement of some do not apply to the idea held by others. A. key object is not stored in dict. d[k] = o ==> _d[id(k)] = o Faster but limited in that cannot extract keys and user code must keep keys alive. B. key object stored in dict in one of at least two ways. 1. d[k] = o ==> _d[id(k)] = (k,o) 2. d[i] = o ==> i=id(k); _dk[i] = k; _do[i] = o Slower, but (I believe) can fully emulate dicts. Either type can be generalized to f instead of id as the key transform. current vote: -.3 I am also not yet convinced, but perhaps could be, that either type, with or without generalization should be in the stdlib. Instances of user class without custom equality are already compared by identity. The use cases for keying immutables by identify is pretty sparse. That pretty much leave mutables with custom equality (by value rather than identity). Terry Jan Reedy

Jack Diederich

11:31 p.m.

On Mon, May 31, 2010 at 2:05 PM, Terry Reedy <tjreedy@udel.edu> wrote:

...

current vote: -.3 I am also not yet convinced, but perhaps could be, that either type, with or without generalization should be in the stdlib. Instances of user class without custom equality are already compared by identity. The use cases for keying immutables by identify is pretty sparse. That pretty much leave mutables with custom equality (by value rather than identity).

I'm -1 on the idea without a strong use case. I vaguely recall implementing one of these before but I think I was using it as a hacky weakrefdict. Looking in my libmisc.py for dict-alikes I see an OrderedDict (obsoleted), a ForgivingDict (obsoleted by defaultdict), a ProxyDict, and a DecorateDict. The ProxyDict can push/pop dicts and does lookups across all of them, most recent first, and performs sets in the most recent. The DecorateDict calls a function on the value before returning it. Django has classes with almost the exact same code (not contributed by me). Django: http://code.djangoproject.com/svn/django/trunk/django/utils/datastructures.p... Me: http://bazaar.launchpad.net/~odbrazen/leanlyn/trunk/annotate/head:/libmisc.p... -Jack

Benjamin Peterson

June 2010

12:44 a.m.

Raymond Hettinger <raymond.hettinger@...> writes:

...

Also, I haven't seen much of a discussion of use cases.

Here's a selection of use cases from PyPy's source (You can search for "identity_dict" to see its use): In a algorithm for breaking cycles in graphs: http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py Essentially these are places where defined equality should not matter. I could also use it here: http://code.activestate.com/recipes/577242-calling-c-level-finalizers-withou...

Greg Ewing

May 2010

12:30 a.m.

Benjamin Peterson wrote:

...

Lie Ryan <lie.1296@...> writes:

...
what's wrong with dict[id(key)] = foo?

For one, you can't get the value of the key out of the dict.

Another thing is that it doesn't keep the key object alive, so if the app doesn't do anything else to ensure this, it's possible for the key to disappear and be replaced by another object having the same id. -- Greg

Raymond Hettinger

5:25 p.m.

On May 30, 2010, at 5:30 PM, Greg Ewing wrote:

...

Benjamin Peterson wrote:

...
Lie Ryan <lie.1296@...> writes:

...
what's wrong with dict[id(key)] = foo? For one, you can't get the value of the key out of the dict.

Another thing is that it doesn't keep the key object alive, so if the app doesn't do anything else to ensure this, it's possible for the key to disappear and be replaced by another object having the same id.

If you don't have an external reference to the object, how are you ever going to look it up in the dictionary? Since we can attach values directly to many objects (using instance variables, function attributes, etc) or can keep them in a tuple, there appear to be remarkably few use cases for an identity dictionary. Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value. Raymond

Benjamin Peterson

June 2010

12:45 a.m.

Raymond Hettinger <raymond.hettinger@...> writes:

...

Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value.

It can be implemented simply in Python: http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py

Raymond Hettinger

1:23 a.m.

On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:

...

Raymond Hettinger <raymond.hettinger@...> writes:

...
Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value.

It can be implemented simply in Python: http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py

That code is pretty much what I expected. In CPython, it is dramatically slower than using a regular dictionary with d[id(obj)]=value. In PyPy, it makes sense because the code gets optimized as if it were hand coded in C. IOW, identity_dict.py doesn't make much sense for other implementations.

...

Here's a selection of use cases from PyPy's source (You can search for "identity_dict" to see its use):

In a algorithm for breaking cycles in graphs: http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py

This is code that doesn't require or benefit from using an identity dictionary. Regular dicts work just fine here. And since, identity-implies-equality for regular CPython dicts, you already get excellent performance (i.e. the __eq__ methods never get called when the object identities already match).

...

Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py

This is a ton of code and I can't easily tell what it is doing or comment on it.

...

Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py

Looks to be another case where a regular dict works just fine.

...

Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py

This code also works fine with a regular dictionary or a regular python set. If you used the identity_dict.py code mentioned above, it would just slow down the code. This isn't really even a dictionary use case, a set would be a better choice.

...

Essentially these are places where defined equality should not matter.

Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py. Since instances have a default hash equal to the id and since identity-implies-equality for dictionary keys, we already have a dictionary that handles these cases. You don't even have to type: d[id(k)]=value, it would suffice to write: d[k]=value. Sorry, but I think this idea is a total waste. Perhaps post it as a recipe, but it doesn't make sense to try to inject it into the standard library. Raymond

Jacob Holm

7:52 a.m.

On 2010-06-01 03:23, Raymond Hettinger wrote:

...

On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:

...
Essentially these are places where defined equality should not matter.

"should not matter" is the important part here. It might have been clearer to say "should be ignored" instead. I think Raymond is misunderstanding it.

...

Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py.

It is necessary, because the objects involved might define their own __hash__ and __cmp__/__eq__, and these should *not* be used.

...

Sorry, but I think this idea is a total waste. Perhaps post it as a recipe, but it doesn't make sense to try to inject it into the standard library.

I don't think it is a total waste, but I have seen two ideas in this thread that I find more generally useful. One is "collections.keyfuncdict", which could be trivially used as an identitydict. The other is a WeakIdentityDict, which is a WeakKeyDict that uses only the identity of the keys for hashing/equality. These two are independent, one cannot be used to implement the other (unless collections.keyfuncdict grows an option to not keep strong refs to the keys, perhaps by providing the inverse keyfunc instead). Anyway, +0.1 on identitydict and +1 on each of collection.keyfuncdict and WeakIdentityDict. - Jacob

Benjamin Peterson

5:42 p.m.

Raymond Hettinger <raymond.hettinger@...> writes:

...

Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value.

A slightly hacky way to implement this on CPython without copying much would be to implement __getitem__, and __setitem__ in C and subclass it in Python to call those methods in the rest of dictionary implementation.

Imri Goldberg

May 2010

10:34 p.m.

On Sun, May 30, 2010 at 12:07 PM, Lie Ryan <lie.1296@gmail.com> wrote:

...

...
In the spirit of collections.OrderedDict and collections.defaultdict, I'd

On 05/30/10 13:27, Benjamin Peterson wrote: like

...
to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

My 2 bits: 1. I implemented and used such a dict a few years ago, as part of a graph/tree algorithm. While I don't have access to the source anymore, I mainly wanted to bring up an example for the usage of such a data structure. Still, I haven't needed it since, so take from it what you will.

...

what's wrong with dict[id(key)] = foo?

2. Mostly that you want other operators to work as well. In your example, "key in dic" will return False, as __contains__ is the standard implementation. Cheers, Imri -- Imri Goldberg -------------------------------------- http://plnnr.com/ - automatic trip planning http://www.algorithm.co.il/blogs/ -------------------------------------- -- insert signature here ----

Benjamin Peterson

June 2010

2:31 a.m.

Lie Ryan <lie.1296@...> writes:

...

that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

The stdlib isn't just about CPython. We already have optimized primitives for CPython, so I don't see why helping other implementations isn't a good cause.

Raymond Hettinger

4:37 a.m.

On May 31, 2010, at 7:31 PM, Benjamin Peterson wrote:

...

Lie Ryan <lie.1296@...> writes:

...
that their id() is expensive is implementation details, and the developer of PyPy should solve that instead of adding a clutch to the stdlib.

The stdlib isn't just about CPython. We already have optimized primitives for CPython, so I don't see why helping other implementations isn't a good cause.

Benjamin, could you elaborate of several points that are unclear: * If id() is expensive in PyPy, then how are they helped by the code in http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py which uses id() for the gets and sets and contains? * In the examples you posted (such as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ), it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module? * Most of the posted examples already work with regular dicts (which check identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python: >>> class A: ... pass >>> a = A() >>> d = {a: 10} >>> assert d[a] == 10 # uses a's identity for lookup * Is the proposal something needed for all implementations or is it just an optimization for a particular, non-CPython implementation? Raymond

Benjamin Peterson

9:17 p.m.

Raymond Hettinger <raymond.hettinger@...> writes:

...

Benjamin, could you elaborate of several points that are unclear:

* If id() is expensive in PyPy, then how are they helped by the code in http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py which uses id() for the gets and sets and contains?

At the top of that file, it imports from the special module __pypy__ which contains an optimized version of the dict.

...

* In the examples you posted (such

as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ),

...

it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module?

My purpose with those examples was to prove it as a generally useful utility.

...

* Most of the posted examples already work with regular dicts (which check

identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python:

...

>>> class A: ... pass >>> a = A() >>> d = {a: 10} >>> assert d[a] == 10 # uses a's identity for lookup

Yes, but that would be different if you have two "a"s with __eq__ defined to be equal and you want to hash them separately.

...

* Is the proposal something needed for all implementations or is it just an

optimization for a particular, non-CPython implementation? My contention is that an identity dictionary or at least a dictionary with custom hash and keys is a useful primitive that should be in the standard library. However, I also see its advantage in avoiding bad performance of id() based identity dicts in non-CPython implementations. It is useful to let the implementation optimize it any time there is moving GC as in Jython and IronPython where id also is expensive. (Basically a mapping has to be maintained for all objects on which id is called.)

Marcin 'Qrczak' Kowalczyk

10:05 p.m.

2010/6/1 Benjamin Peterson <benjamin@python.org>:

...

My contention is that an identity dictionary or at least a dictionary with custom hash and keys is a useful primitive that should be in the standard library. However, I also see its advantage in avoiding bad performance of id() based identity dicts in non-CPython implementations.

It is useful to let the implementation optimize it any time there is moving GC as in Jython and IronPython where id also is expensive. (Basically a mapping has to be maintained for all objects on which id is called.)

Here is how I designed this for my language: You can request the ObjectId of the given object. If an ObjectId corresponding to the given object is still alive, you always get it back again, but it can be GC'ed and later created afresh. ObjectIds are hashable and comparable (with an arbitrary ordering). Hash values and the ordering are preserved when ObjectIds are kept alive, but they may be different if ObjectIds are created afresh. An ObjectId contains an integer index which is unique among ObjectIds being alive at the same time. You can make a dictionary with a specified key function. It is internally backed by something equivalent to f(k) -> (k, v) dict. A dictionary with ObjectId constructor as the key is an identity dictionary; it works because it keeps both k and f(k) alive. An advantage of this scheme is that with a moving GC the id mapping must be maintained only for objects for which the program keeps their ObjectIds alive. A disadvantage is that the program must be careful to not use ObjectIds in a manner which does not keep them alive yet expects consistent hashing and ordering. In particular a key-mapped dict which would store only (k, v) pairs and compute f(k) on the fly would not work. Also ObjectIds cannot be used to generate printable unique identifiers which would be valid without having to keep ObjectIds alive, like in Python's default repr. -- Marcin Kowalczyk

Raymond Hettinger

4:37 p.m.

...

...
* In the examples you posted (such as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ), it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module?

My purpose with those examples was to prove it as a generally useful utility.

...
* Most of the posted examples already work with regular dicts (which check

identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python:

...
>>> class A: ... pass >>> a = A() >>> d = {a: 10} >>> assert d[a] == 10 # uses a's identity for lookup

Yes, but that would be different if you have two "a"s with __eq__ defined to be equal and you want to hash them separately.

None of the presented examples take advantage of that property. All of them work with regular dictionaries. This proposal is still use case challenged. AFAICT from code searches, the idea of needing to override an existing __eq__ with an identity-only comparison seems to never come up. It would not even be popular as an ASPN recipe. Moreover, I think that including it in the standard library would be harmful. The language makes very few guarantees about object identity. In most cases a user would far better off using a regular dictionary. If a rare case arose where __eq__ needed to be overridden with an identity-only check, it is not hard to write d[id(obj)]=value. Strong -1 on including this in the standard library. Raymond P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity, how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

Benjamin Peterson

5:38 p.m.

Raymond Hettinger <raymond.hettinger@...> writes:

...

None of the presented examples take advantage of that property. All of them work with regular dictionaries. This proposal is still use case challenged.

Besides that ASPN recipe of mine I mentioned before [1], here are some more examples: - copy.deepcopy() and pickle use it for an object memo. - keeping track of protocol versions in Twisted. [2] - memo in a serialization protocol. [3] [1] http://code.activestate.com/recipes/577242-calling-c-level-finalizers- without-__del__/ [2] http://twistedmatrix.com/trac/browser/trunk/twisted/persisted/ styles.py [3] http://twistedmatrix.com/trac/browser/trunk/twisted/spread/jelly.py

...

AFAICT from code searches, the idea of needing to override an existing __eq__ with an identity-only comparison seems to never come up. It would not even be popular as an ASPN recipe.

Moreover, I think that including it in the standard library would be harmful. The language makes very few guarantees about object identity. In most cases a user would far better off using a regular dictionary. If a rare case arose where __eq__ needed to be overridden with an identity-only check, it is not hard to write d[id(obj)]=value.

On the other hand, d[id(obj)] can be dangerous and incorrect on CPython:

...

...
...
d = {} d[id([])] = 10 d[id([])] 10

...

Strong -1 on including this in the standard library.

How do you feel about Antoine's keyfuncdict proposal?

...

Raymond

P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity,

We're all adults here. We provide WeakKeyedDict and friends even though they rely on unpredictable and subtle garbage collection.

...

how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

I would argue that that's not a better solution given the above example. Anyone using id(obj) would have understand the nuances of object identity perhaps more than a real identity dictionary.

Raymond Hettinger

5:58 p.m.

On Jun 2, 2010, at 10:38 AM, Benjamin Peterson wrote:

...

...
Strong -1 on including this in the standard library.

How do you feel about Antoine's keyfuncdict proposal?

His proposal is more interesting because it is more general. For example, a keyfuncdict may make it easy to implement a case-insensitive dictionary. I would like to see a concrete implementation to see what design choices it makes. Raymond

Philip Jenvey

7:02 p.m.

On Jun 2, 2010, at 9:37 AM, Raymond Hettinger wrote:

...

Moreover, I think that including it in the standard library would be harmful. The language makes very few guarantees about object identity. In most cases a user would far better off using a regular dictionary. If a rare case arose where __eq__ needed to be overridden with an identity-only check, it is not hard to write d[id(obj)]=value.

Strong -1 on including this in the standard library.

P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity, how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

...

...
Essentially these are places where defined equality should not matter.

Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py.

Using id() is a workaround but again, a potentially expensive one for platforms with moving GCs. Every object calling for an id() forces additional bookkeeping on their ends. This is only a better solution for CPython. Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.

...

Since instances have a default hash equal to the id and since identity-implies-equality for dictionary keys, we already have a dictionary that handles these cases. You don't even have to type: d[id(k)]=value, it would suffice to write: d[k]=value.

No, the default hash backed by id is a CPython implementation detail. Another use case is just the fact that Python allows you to completely change the semantics of __eq__ (and for good reason). Though this is rare, take SQLAlchemy's SQL expression DSL for example, that has it generate a where clause: table.select(table.c.id == 4) # table.c.id == 4 returns a "<sql statement> where id == 4" object I don't see how a platform like Jython can provide an optimized identitydict that avoids id() calls via keyfuncdict(key=id). The keys() of said dict would need to be actual results of id() calls. I'm +1 on an identitydict as long as its CPython implementation doesn't provide worse performance than the id workaround. -- Philip Jenvey

Raymond Hettinger

7:43 p.m.

...

...
P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity, how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

...
...
Essentially these are places where defined equality should not matter.

Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py.

Using id() is a workaround but again, a potentially expensive one for platforms with moving GCs. Every object calling for an id() forces additional bookkeeping on their ends. This is only a better solution for CPython.

To be clear, most the examples given so far work with regular dictionaries even without using id(). The exception was something system specific such as the pickling mechanism. So what we're talking about is the comparatively rare case when an object has an __eq__ method and you want that method to be ignored. For example, you have two tuples (3,5) and (3,5) which are equal but happen to be distinct in memory and your needs are: * to treat the two equal objects as being distinct for some purpose * to run faster than id() runs on non-CPython implementations * don't care if the code is dog slow on CPython (i.e. slower than if you had used id()) * don't care that the two tuples being distinct is memory is not a guaranteed behavior across implementations (i.e. any implementation is free to make all equal tuples share the same id via interning) FWIW, I spoke with Jim Baker about this yesterday and he believes that Jython has no need for an identity dict. Raymond P.S. If Antoine's keyfuncdict proposal gains traction, it would be possible for other implementations to create a fast special case for key=id.

Antoine Pitrou

8:04 p.m.

On Thu, 3 Jun 2010 12:02:44 -0700 Philip Jenvey <pjenvey@underboss.org> wrote:

...

Using id() is a workaround but again, a potentially expensive one for platforms with moving GCs. Every object calling for an id() forces additional bookkeeping on their ends. This is only a better solution for CPython.

Well, CPython and all other implementations with a non-moving GC.

...

Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.

That's really premature optimization. Regards Antoine.

Michael

May 2010

9:29 a.m.

+1 - this has been useful to me in other languages. (apologies for top-post, sent from phone.) Michael Foord -- http://www.ironpythoninaction.com On 30 May 2010, at 04:27, Benjamin Peterson <benjamin@python.org> wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

Regards, Benjamin

_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Antoine Pitrou

10:34 a.m.

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id) But of course, you can just use a normal dict: d = {} d[id(key)] = key, value (actually, this could be how a collections.keyfuncdict gets implemented) Regards Antoine.

Lie Ryan

11:07 a.m.

On 05/30/10 20:34, Antoine Pitrou wrote:

...

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:

...
In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id)

But of course, you can just use a normal dict: d = {} d[id(key)] = key, value

(actually, this could be how a collections.keyfuncdict gets implemented)

+1 on this

Mathias Panzenböck

1:43 p.m.

On 05/30/2010 12:34 PM, Antoine Pitrou wrote:

...

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson<benjamin@python.org> wrote:

...
In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id)

But of course, you can just use a normal dict: d = {} d[id(key)] = key, value

(actually, this could be how a collections.keyfuncdict gets implemented)

Regards

Antoine.

Yes, something like this or a dict wihch you can pass a custom hash and compare function would be a good idea, imho. +1 -panzi

Benjamin Peterson

2:27 p.m.

Antoine Pitrou <solipsis@...> writes:

...

Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key

Perhaps; I'm only really interested in the identity version. That could be specified by keyfuncdict(None, None).

Antoine Pitrou

2:54 p.m.

On Sun, 30 May 2010 14:27:05 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:

...

Antoine Pitrou <solipsis@...> writes:

...
Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key

Perhaps; I'm only really interested in the identity version. That could be specified by keyfuncdict(None, None).

Just keyfuncdict(None) or keyfuncdict(id), then. I'm not proposing that the hash() and eq() function be passed, but really a factory function which is used for computing the actual lookup key when doing inserts and lookups (which then implicitly use the natural hash and eq functions for that lookup key, as a normal dict does). Regards Antoine.

Daniel Stutzbach

7:37 p.m.

On Sun, May 30, 2010 at 6:34 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id)

For what it's worth, my sorteddict class (part of the blist library) already supports exactly that syntax. The full signature is: sorteddict([key[, arg]], **kw) where "arg" and "kw" do exactly what they do for dict(), and "key" specifies a key function just as in your example. (Although now that I think about it, perhaps "arg" and "key" should be reversed to more closely match the signature of dict()) Example usage:

...

...
...
from blist import sorteddict d = sorteddict(id) d[(1,2)] = True d[(1,2)] = True d.keys() [(1, 2), (1, 2)]

I'm not suggesting that Benjamin use sorteddict for his use-case, since keeping the keys sorted by id() is probably not useful. ;-) In some ways it would be nice if dict() itself accepted a "key" function as an argument, but I don't see a way to provide that functionality without adding another field to PyDictObject (yuck) or abusing the ma_lookup field (again, yuck). I favor adding the more general keyfuncdict over identitydict. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

spir ☣

11:20 a.m.

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This is how Lua works, systematically. To make this work, all "primitive" values (non-table, since table is the single container type) are interned. Thus, for "values", equality and identity are equivalent. Tables instead are considered referenced objects, even when they are used as collections. So that Lua can quietly hash the ref instead of the value. * simplicity * performance * anything can be used as key The drawback is that any structured value, eg an x,y position stored in a table, is compared by reference. A lookup with a position used as key thus cannot return an item stored with an equal key:

...

p1 = {x=1,y=2} p2 = {x=1,y=2} t = {[p1]="abc"} print(t[p1],tp2) abc nil Conversely, collections which, conceptually, are referenced things, do not erroneaously return false positive because of value comparison. (The case does not happen in python since collections cannot be used as keys).

For the scheme to really extend to any use case, there should be a kind of mark allowing the programmer to state whether a table is, conceptually, a value (simply happens to be structured) or a referenced thing. Value tables should then be interned like other values -- but this is certainly difficult and costly. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

Bill Janssen

June 2010

5:03 p.m.

Denis, if you're going to post to python-ideas, would you mind taking that biohazard symbol out of your user name? My Emacs-based mail reader thrashes for quite a while trying to find a glyph for it before it gives up and renders it as a hollow rectangle. I'd normally just add you to my kill-file, but I hate to give up on python-ideas people that fast. I'm sure you're not really a biohazard :-). Bill

spir ☣

May 2010

11:28 a.m.

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:

...

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

The main point is certainly that the limitation of "hashability" disappears... Everything can be used as key, if actually the ref is hashed instead of the value. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

Nick Coghlan

12:05 p.m.

On 30/05/10 13:27, Benjamin Peterson wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

How would dict equality be defined? In terms of values, or in terms of value identity? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Benjamin Peterson

2:18 p.m.

Nick Coghlan <ncoghlan@...> writes:

...

How would dict equality be defined? In terms of values, or in terms of value identity?

Completely identity. __eq__ and __hash__ should never be called.

Mike Graham

5:04 p.m.

What is the use case for such a collection? (This will help answer questions about how stuff like comparing two identity dicts or an identity dict and another mapping will work.) By default, objects __eq__ and __hash__ methods allow them to be used in dicts by identity. When I define __eq__ to mean something else, I find that I don't typically have something I can really store in a dict with the behavior I want. Should an ID Dict really be a weak ref identity dict? The one time I wanted an ID dict is when I was writing a memoization routine for methods, where I wanted to store the caches for each object. (Incidentally, I ended up refactoring not to require this anyhow.) Is this the main basic use case? Am I missing something? Best regards, Mike On Sat, May 29, 2010 at 10:27 PM, Benjamin Peterson <benjamin@python.org> wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

Regards, Benjamin

Greg Ewing

12:37 a.m.

Mike Graham wrote:

...

Should an ID Dict really be a weak ref identity dict? ... Is this the main basic use case?

No, I don't think it's the main use case, although if an identitydict is added, it would make sense to also add a WeakIdentityDict to the weakref module. -- Greg

M.-A. Lemburg

9:16 p.m.

Benjamin Peterson wrote:

...

In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id().

This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash.

An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict.

Are you sure this is a good idea for the stdlib ? id(obj) gives you the current storage address of an object in CPython and these can be reused over time. Using a key object to such a dict would not secure it's persistence throughout the usage lifetime in that dict, since d[id(key)] = 1 doesn't increase the refcount of key. Such reuse is bound to happen often for some Python object types, due to the free-lists we're using in CPython and the Python allocator which is optimized for reuse of fixed-size memory chunks. For experts, such a dictionary may have some value (e.g. to work around the hash() requirement of normal dictionaries), but I feel that newbies would have a hard time understanding all the implications. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 30 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK 49 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Benjamin Peterson

10:17 p.m.

M.-A. Lemburg <mal@...> writes:

...

Are you sure this is a good idea for the stdlib ?

id(obj) gives you the current storage address of an object in CPython and these can be reused over time.

But since the dictionary will hold a reference to the key, it won't be reused.

...

For experts, such a dictionary may have some value (e.g. to work around the hash() requirement of normal dictionaries), but I feel that newbies would have a hard time understanding all the implications.

I'm not proposing a dictionary that would not hold a reference to its keys.

5378

Age (days ago)

5382

Last active (days ago)

List overview

Download

43 comments

20 participants

participants (20)

Antoine Pitrou
Benjamin Peterson
Bill Janssen
Daniel Stutzbach
Greg Ewing
Imri Goldberg
Jack Diederich
Jacob Holm
Lie Ryan
M.-A. Lemburg
Marcin 'Qrczak' Kowalczyk
Masklinn
Mathias Panzenböck
Michael
Mike Graham
Nick Coghlan
Philip Jenvey
Raymond Hettinger
spir ☣
Terry Reedy

An identity dict

Jacob Holm

Imri Goldberg

Marcin 'Qrczak' Kowalczyk

Mathias Panzenböck

Daniel Stutzbach

spir ☣

spir ☣

tags

participants (20)