
In the spirit of collections.OrderedDict and collections.defaultdict, I'd like to propose collections.identitydict. It would function just like a normal dictionary, but ignore hash values and comparison operators and merely lookup keys based on the key's id(). This dict is very useful for keep track of objects that should not be compared by normal comparison methods. For example, I would use an identitydict to hold weak references that would otherwise fallback to their referant object's hash. An advantage of formalizing this in collections would be to enable other Python implementations like PyPy, where id() is expensive, to provide an optimized identitydict. Regards, Benjamin

On 2010-05-30, at 11:07 , Lie Ryan wrote:
Actually, I'd say the exact opposite: CPython's identity being cheap is an implementation detail, not the other way around, and that is what shouldn't be relied on. In that light, a special-purpose identity dictionary independent from implementation-specific and artificially low id() computation costs is a pretty good idea.

On 05/30/10 20:35, Masklinn wrote:
the performance of any Python statement/function is implementation detail. Whether any statement/function is fast or cheap shouldn't be relied on. IMHO, adding a brand new collection module just because a Python implementation has a fast/cheap operations isn't a good reason.

Lie Ryan <lie.1296@...> writes:
it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?
Why defaultdict then? It can be implemented just fine with dict.setdefault. identitydict avoids the unnatural method of having the key and value as value of the dictionary, resulting in contortions everytime the dictionary is manipulated.

On May 30, 2010, at 7:56 AM, Lie Ryan wrote:
. . .
it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?
I agree that this would be a waste. Also, I haven't seen much of a discussion of use cases. It's been possible to make identity dictionaries for a very long time, yet I can't remember the last time I saw one used. The Python language makes very few guarantees about object identity so it is mainly only usable in a few circumstances where an object is continually referenced by something outside the identity dict. I would like to see an identity dict posted an ASPN recipe to see if anyone actually uses it. Raymond

It appears that different people are talking about two different ideas of identity dict, with different properties and possible use cased. So the statement of some do not apply to the idea held by others. A. key object is not stored in dict. d[k] = o ==> _d[id(k)] = o Faster but limited in that cannot extract keys and user code must keep keys alive. B. key object stored in dict in one of at least two ways. 1. d[k] = o ==> _d[id(k)] = (k,o) 2. d[i] = o ==> i=id(k); _dk[i] = k; _do[i] = o Slower, but (I believe) can fully emulate dicts. Either type can be generalized to f instead of id as the key transform. current vote: -.3 I am also not yet convinced, but perhaps could be, that either type, with or without generalization should be in the stdlib. Instances of user class without custom equality are already compared by identity. The use cases for keying immutables by identify is pretty sparse. That pretty much leave mutables with custom equality (by value rather than identity). Terry Jan Reedy

On Mon, May 31, 2010 at 2:05 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I'm -1 on the idea without a strong use case. I vaguely recall implementing one of these before but I think I was using it as a hacky weakrefdict. Looking in my libmisc.py for dict-alikes I see an OrderedDict (obsoleted), a ForgivingDict (obsoleted by defaultdict), a ProxyDict, and a DecorateDict. The ProxyDict can push/pop dicts and does lookups across all of them, most recent first, and performs sets in the most recent. The DecorateDict calls a function on the value before returning it. Django has classes with almost the exact same code (not contributed by me). Django: http://code.djangoproject.com/svn/django/trunk/django/utils/datastructures.p... Me: http://bazaar.launchpad.net/~odbrazen/leanlyn/trunk/annotate/head:/libmisc.p... -Jack

Raymond Hettinger <raymond.hettinger@...> writes:
Also, I haven't seen much of a discussion of use cases.
Here's a selection of use cases from PyPy's source (You can search for "identity_dict" to see its use): In a algorithm for breaking cycles in graphs: http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py Essentially these are places where defined equality should not matter. I could also use it here: http://code.activestate.com/recipes/577242-calling-c-level-finalizers-withou...

On May 30, 2010, at 5:30 PM, Greg Ewing wrote:
If you don't have an external reference to the object, how are you ever going to look it up in the dictionary? Since we can attach values directly to many objects (using instance variables, function attributes, etc) or can keep them in a tuple, there appear to be remarkably few use cases for an identity dictionary. Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value. Raymond

Raymond Hettinger <raymond.hettinger@...> writes:
It can be implemented simply in Python: http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py

On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:
That code is pretty much what I expected. In CPython, it is dramatically slower than using a regular dictionary with d[id(obj)]=value. In PyPy, it makes sense because the code gets optimized as if it were hand coded in C. IOW, identity_dict.py doesn't make much sense for other implementations.
This is code that doesn't require or benefit from using an identity dictionary. Regular dicts work just fine here. And since, identity-implies-equality for regular CPython dicts, you already get excellent performance (i.e. the __eq__ methods never get called when the object identities already match).
Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py
This is a ton of code and I can't easily tell what it is doing or comment on it.
Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py
Looks to be another case where a regular dict works just fine.
Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py
This code also works fine with a regular dictionary or a regular python set. If you used the identity_dict.py code mentioned above, it would just slow down the code. This isn't really even a dictionary use case, a set would be a better choice.
Essentially these are places where defined equality should not matter.
Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py. Since instances have a default hash equal to the id and since identity-implies-equality for dictionary keys, we already have a dictionary that handles these cases. You don't even have to type: d[id(k)]=value, it would suffice to write: d[k]=value. Sorry, but I think this idea is a total waste. Perhaps post it as a recipe, but it doesn't make sense to try to inject it into the standard library. Raymond

On 2010-06-01 03:23, Raymond Hettinger wrote:
On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:
Essentially these are places where defined equality should not matter.
"should not matter" is the important part here. It might have been clearer to say "should be ignored" instead. I think Raymond is misunderstanding it.
It is necessary, because the objects involved might define their own __hash__ and __cmp__/__eq__, and these should *not* be used.
I don't think it is a total waste, but I have seen two ideas in this thread that I find more generally useful. One is "collections.keyfuncdict", which could be trivially used as an identitydict. The other is a WeakIdentityDict, which is a WeakKeyDict that uses only the identity of the keys for hashing/equality. These two are independent, one cannot be used to implement the other (unless collections.keyfuncdict grows an option to not keep strong refs to the keys, perhaps by providing the inverse keyfunc instead). Anyway, +0.1 on identitydict and +1 on each of collection.keyfuncdict and WeakIdentityDict. - Jacob

On Sun, May 30, 2010 at 12:07 PM, Lie Ryan <lie.1296@gmail.com> wrote:
My 2 bits: 1. I implemented and used such a dict a few years ago, as part of a graph/tree algorithm. While I don't have access to the source anymore, I mainly wanted to bring up an example for the usage of such a data structure. Still, I haven't needed it since, so take from it what you will.
what's wrong with dict[id(key)] = foo?
2. Mostly that you want other operators to work as well. In your example, "key in dic" will return False, as __contains__ is the standard implementation. Cheers, Imri -- Imri Goldberg -------------------------------------- http://plnnr.com/ - automatic trip planning http://www.algorithm.co.il/blogs/ -------------------------------------- -- insert signature here ----

On May 31, 2010, at 7:31 PM, Benjamin Peterson wrote:
Benjamin, could you elaborate of several points that are unclear: * If id() is expensive in PyPy, then how are they helped by the code in http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py which uses id() for the gets and sets and contains? * In the examples you posted (such as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ), it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module? * Most of the posted examples already work with regular dicts (which check identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python: >>> class A: ... pass >>> a = A() >>> d = {a: 10} >>> assert d[a] == 10 # uses a's identity for lookup * Is the proposal something needed for all implementations or is it just an optimization for a particular, non-CPython implementation? Raymond

Raymond Hettinger <raymond.hettinger@...> writes:
At the top of that file, it imports from the special module __pypy__ which contains an optimized version of the dict.
* In the examples you posted (such
as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ),
it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module?
My purpose with those examples was to prove it as a generally useful utility.
* Most of the posted examples already work with regular dicts (which check
identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python:
Yes, but that would be different if you have two "a"s with __eq__ defined to be equal and you want to hash them separately.
* Is the proposal something needed for all implementations or is it just an
optimization for a particular, non-CPython implementation? My contention is that an identity dictionary or at least a dictionary with custom hash and keys is a useful primitive that should be in the standard library. However, I also see its advantage in avoiding bad performance of id() based identity dicts in non-CPython implementations. It is useful to let the implementation optimize it any time there is moving GC as in Jython and IronPython where id also is expensive. (Basically a mapping has to be maintained for all objects on which id is called.)

2010/6/1 Benjamin Peterson <benjamin@python.org>:
Here is how I designed this for my language: You can request the ObjectId of the given object. If an ObjectId corresponding to the given object is still alive, you always get it back again, but it can be GC'ed and later created afresh. ObjectIds are hashable and comparable (with an arbitrary ordering). Hash values and the ordering are preserved when ObjectIds are kept alive, but they may be different if ObjectIds are created afresh. An ObjectId contains an integer index which is unique among ObjectIds being alive at the same time. You can make a dictionary with a specified key function. It is internally backed by something equivalent to f(k) -> (k, v) dict. A dictionary with ObjectId constructor as the key is an identity dictionary; it works because it keeps both k and f(k) alive. An advantage of this scheme is that with a moving GC the id mapping must be maintained only for objects for which the program keeps their ObjectIds alive. A disadvantage is that the program must be careful to not use ObjectIds in a manner which does not keep them alive yet expects consistent hashing and ordering. In particular a key-mapped dict which would store only (k, v) pairs and compute f(k) on the fly would not work. Also ObjectIds cannot be used to generate printable unique identifiers which would be valid without having to keep ObjectIds alive, like in Python's default repr. -- Marcin Kowalczyk

None of the presented examples take advantage of that property. All of them work with regular dictionaries. This proposal is still use case challenged. AFAICT from code searches, the idea of needing to override an existing __eq__ with an identity-only comparison seems to never come up. It would not even be popular as an ASPN recipe. Moreover, I think that including it in the standard library would be harmful. The language makes very few guarantees about object identity. In most cases a user would far better off using a regular dictionary. If a rare case arose where __eq__ needed to be overridden with an identity-only check, it is not hard to write d[id(obj)]=value. Strong -1 on including this in the standard library. Raymond P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity, how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

Raymond Hettinger <raymond.hettinger@...> writes:
Besides that ASPN recipe of mine I mentioned before [1], here are some more examples: - copy.deepcopy() and pickle use it for an object memo. - keeping track of protocol versions in Twisted. [2] - memo in a serialization protocol. [3] [1] http://code.activestate.com/recipes/577242-calling-c-level-finalizers- without-__del__/ [2] http://twistedmatrix.com/trac/browser/trunk/twisted/persisted/ styles.py [3] http://twistedmatrix.com/trac/browser/trunk/twisted/spread/jelly.py
On the other hand, d[id(obj)] can be dangerous and incorrect on CPython:
Strong -1 on including this in the standard library.
How do you feel about Antoine's keyfuncdict proposal?
We're all adults here. We provide WeakKeyedDict and friends even though they rely on unpredictable and subtle garbage collection.
I would argue that that's not a better solution given the above example. Anyone using id(obj) would have understand the nuances of object identity perhaps more than a real identity dictionary.

On Jun 2, 2010, at 10:38 AM, Benjamin Peterson wrote:
Strong -1 on including this in the standard library.
How do you feel about Antoine's keyfuncdict proposal?
His proposal is more interesting because it is more general. For example, a keyfuncdict may make it easy to implement a case-insensitive dictionary. I would like to see a concrete implementation to see what design choices it makes. Raymond

On Jun 2, 2010, at 9:37 AM, Raymond Hettinger wrote:
Using id() is a workaround but again, a potentially expensive one for platforms with moving GCs. Every object calling for an id() forces additional bookkeeping on their ends. This is only a better solution for CPython. Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.
No, the default hash backed by id is a CPython implementation detail. Another use case is just the fact that Python allows you to completely change the semantics of __eq__ (and for good reason). Though this is rare, take SQLAlchemy's SQL expression DSL for example, that has it generate a where clause: table.select(table.c.id == 4) # table.c.id == 4 returns a "<sql statement> where id == 4" object I don't see how a platform like Jython can provide an optimized identitydict that avoids id() calls via keyfuncdict(key=id). The keys() of said dict would need to be actual results of id() calls. I'm +1 on an identitydict as long as its CPython implementation doesn't provide worse performance than the id workaround. -- Philip Jenvey

To be clear, most the examples given so far work with regular dictionaries even without using id(). The exception was something system specific such as the pickling mechanism. So what we're talking about is the comparatively rare case when an object has an __eq__ method and you want that method to be ignored. For example, you have two tuples (3,5) and (3,5) which are equal but happen to be distinct in memory and your needs are: * to treat the two equal objects as being distinct for some purpose * to run faster than id() runs on non-CPython implementations * don't care if the code is dog slow on CPython (i.e. slower than if you had used id()) * don't care that the two tuples being distinct is memory is not a guaranteed behavior across implementations (i.e. any implementation is free to make all equal tuples share the same id via interning) FWIW, I spoke with Jim Baker about this yesterday and he believes that Jython has no need for an identity dict. Raymond P.S. If Antoine's keyfuncdict proposal gains traction, it would be possible for other implementations to create a fast special case for key=id.

On Thu, 3 Jun 2010 12:02:44 -0700 Philip Jenvey <pjenvey@underboss.org> wrote:
Well, CPython and all other implementations with a non-moving GC.
Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.
That's really premature optimization. Regards Antoine.

+1 - this has been useful to me in other languages. (apologies for top-post, sent from phone.) Michael Foord -- http://www.ironpythoninaction.com On 30 May 2010, at 04:27, Benjamin Peterson <benjamin@python.org> wrote:

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id) But of course, you can just use a normal dict: d = {} d[id(key)] = key, value (actually, this could be how a collections.keyfuncdict gets implemented) Regards Antoine.

On Sun, 30 May 2010 14:27:05 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
Just keyfuncdict(None) or keyfuncdict(id), then. I'm not proposing that the hash() and eq() function be passed, but really a factory function which is used for computing the actual lookup key when doing inserts and lookups (which then implicitly use the natural hash and eq functions for that lookup key, as a normal dict does). Regards Antoine.

On Sun, May 30, 2010 at 6:34 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For what it's worth, my sorteddict class (part of the blist library) already supports exactly that syntax. The full signature is: sorteddict([key[, arg]], **kw) where "arg" and "kw" do exactly what they do for dict(), and "key" specifies a key function just as in your example. (Although now that I think about it, perhaps "arg" and "key" should be reversed to more closely match the signature of dict()) Example usage:
I'm not suggesting that Benjamin use sorteddict for his use-case, since keeping the keys sorted by id() is probably not useful. ;-) In some ways it would be nice if dict() itself accepted a "key" function as an argument, but I don't see a way to provide that functionality without adding another field to PyDictObject (yuck) or abusing the ma_lookup field (again, yuck). I favor adding the more general keyfuncdict over identitydict. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
This is how Lua works, systematically. To make this work, all "primitive" values (non-table, since table is the single container type) are interned. Thus, for "values", equality and identity are equivalent. Tables instead are considered referenced objects, even when they are used as collections. So that Lua can quietly hash the ref instead of the value. * simplicity * performance * anything can be used as key The drawback is that any structured value, eg an x,y position stored in a table, is compared by reference. A lookup with a position used as key thus cannot return an item stored with an equal key:
For the scheme to really extend to any use case, there should be a kind of mark allowing the programmer to state whether a table is, conceptually, a value (simply happens to be structured) or a referenced thing. Value tables should then be interned like other values -- but this is certainly difficult and costly. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

Denis, if you're going to post to python-ideas, would you mind taking that biohazard symbol out of your user name? My Emacs-based mail reader thrashes for quite a while trying to find a glyph for it before it gives up and renders it as a hollow rectangle. I'd normally just add you to my kill-file, but I hate to give up on python-ideas people that fast. I'm sure you're not really a biohazard :-). Bill

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
The main point is certainly that the limitation of "hashability" disappears... Everything can be used as key, if actually the ref is hashed instead of the value. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

What is the use case for such a collection? (This will help answer questions about how stuff like comparing two identity dicts or an identity dict and another mapping will work.) By default, objects __eq__ and __hash__ methods allow them to be used in dicts by identity. When I define __eq__ to mean something else, I find that I don't typically have something I can really store in a dict with the behavior I want. Should an ID Dict really be a weak ref identity dict? The one time I wanted an ID dict is when I was writing a memoization routine for methods, where I wanted to store the caches for each object. (Incidentally, I ended up refactoring not to require this anyhow.) Is this the main basic use case? Am I missing something? Best regards, Mike On Sat, May 29, 2010 at 10:27 PM, Benjamin Peterson <benjamin@python.org> wrote:

Benjamin Peterson wrote:
Are you sure this is a good idea for the stdlib ? id(obj) gives you the current storage address of an object in CPython and these can be reused over time. Using a key object to such a dict would not secure it's persistence throughout the usage lifetime in that dict, since d[id(key)] = 1 doesn't increase the refcount of key. Such reuse is bound to happen often for some Python object types, due to the free-lists we're using in CPython and the Python allocator which is optimized for reuse of fixed-size memory chunks. For experts, such a dictionary may have some value (e.g. to work around the hash() requirement of normal dictionaries), but I feel that newbies would have a hard time understanding all the implications. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 30 2010)
2010-07-19: EuroPython 2010, Birmingham, UK 49 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 2010-05-30, at 11:07 , Lie Ryan wrote:
Actually, I'd say the exact opposite: CPython's identity being cheap is an implementation detail, not the other way around, and that is what shouldn't be relied on. In that light, a special-purpose identity dictionary independent from implementation-specific and artificially low id() computation costs is a pretty good idea.

On 05/30/10 20:35, Masklinn wrote:
the performance of any Python statement/function is implementation detail. Whether any statement/function is fast or cheap shouldn't be relied on. IMHO, adding a brand new collection module just because a Python implementation has a fast/cheap operations isn't a good reason.

Lie Ryan <lie.1296@...> writes:
it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?
Why defaultdict then? It can be implemented just fine with dict.setdefault. identitydict avoids the unnatural method of having the key and value as value of the dictionary, resulting in contortions everytime the dictionary is manipulated.

On May 30, 2010, at 7:56 AM, Lie Ryan wrote:
. . .
it may be useful, but if you can emulate it relatively easily with regular dict, why bother with a new type?
I agree that this would be a waste. Also, I haven't seen much of a discussion of use cases. It's been possible to make identity dictionaries for a very long time, yet I can't remember the last time I saw one used. The Python language makes very few guarantees about object identity so it is mainly only usable in a few circumstances where an object is continually referenced by something outside the identity dict. I would like to see an identity dict posted an ASPN recipe to see if anyone actually uses it. Raymond

It appears that different people are talking about two different ideas of identity dict, with different properties and possible use cased. So the statement of some do not apply to the idea held by others. A. key object is not stored in dict. d[k] = o ==> _d[id(k)] = o Faster but limited in that cannot extract keys and user code must keep keys alive. B. key object stored in dict in one of at least two ways. 1. d[k] = o ==> _d[id(k)] = (k,o) 2. d[i] = o ==> i=id(k); _dk[i] = k; _do[i] = o Slower, but (I believe) can fully emulate dicts. Either type can be generalized to f instead of id as the key transform. current vote: -.3 I am also not yet convinced, but perhaps could be, that either type, with or without generalization should be in the stdlib. Instances of user class without custom equality are already compared by identity. The use cases for keying immutables by identify is pretty sparse. That pretty much leave mutables with custom equality (by value rather than identity). Terry Jan Reedy

On Mon, May 31, 2010 at 2:05 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I'm -1 on the idea without a strong use case. I vaguely recall implementing one of these before but I think I was using it as a hacky weakrefdict. Looking in my libmisc.py for dict-alikes I see an OrderedDict (obsoleted), a ForgivingDict (obsoleted by defaultdict), a ProxyDict, and a DecorateDict. The ProxyDict can push/pop dicts and does lookups across all of them, most recent first, and performs sets in the most recent. The DecorateDict calls a function on the value before returning it. Django has classes with almost the exact same code (not contributed by me). Django: http://code.djangoproject.com/svn/django/trunk/django/utils/datastructures.p... Me: http://bazaar.launchpad.net/~odbrazen/leanlyn/trunk/annotate/head:/libmisc.p... -Jack

Raymond Hettinger <raymond.hettinger@...> writes:
Also, I haven't seen much of a discussion of use cases.
Here's a selection of use cases from PyPy's source (You can search for "identity_dict" to see its use): In a algorithm for breaking cycles in graphs: http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py Essentially these are places where defined equality should not matter. I could also use it here: http://code.activestate.com/recipes/577242-calling-c-level-finalizers-withou...

On May 30, 2010, at 5:30 PM, Greg Ewing wrote:
If you don't have an external reference to the object, how are you ever going to look it up in the dictionary? Since we can attach values directly to many objects (using instance variables, function attributes, etc) or can keep them in a tuple, there appear to be remarkably few use cases for an identity dictionary. Also, there hasn't been much discussion of implementation, but unless you're willing to copy and paste most of the code in dictobject.c, you're going to end-up with something much slower than d[id(obj)]=value. Raymond

Raymond Hettinger <raymond.hettinger@...> writes:
It can be implemented simply in Python: http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py

On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:
That code is pretty much what I expected. In CPython, it is dramatically slower than using a regular dictionary with d[id(obj)]=value. In PyPy, it makes sense because the code gets optimized as if it were hand coded in C. IOW, identity_dict.py doesn't make much sense for other implementations.
This is code that doesn't require or benefit from using an identity dictionary. Regular dicts work just fine here. And since, identity-implies-equality for regular CPython dicts, you already get excellent performance (i.e. the __eq__ methods never get called when the object identities already match).
Keeping track of all the allocated objects in a model of a low level runtime: http://codespeak.net/svn/pypy/trunk/pypy/rpython/lltypesystem/lltype.py
This is a ton of code and I can't easily tell what it is doing or comment on it.
Tracing the source of a certain kind of type as our type checker annotate RPython: http://codespeak.net/svn/pypy/trunk/pypy/annotation/bookkeeper.py
Looks to be another case where a regular dict works just fine.
Traversing the blocks of a function's graph: http://codespeak.net/svn/pypy/trunk/pypy/objspace/flow/model.py
This code also works fine with a regular dictionary or a regular python set. If you used the identity_dict.py code mentioned above, it would just slow down the code. This isn't really even a dictionary use case, a set would be a better choice.
Essentially these are places where defined equality should not matter.
Essentially, these are cases where an identity dictionary isn't necessary and would in-fact be worse performance-wise in every implementation except for PyPy which can compile the pure python code for indentity_dict.py. Since instances have a default hash equal to the id and since identity-implies-equality for dictionary keys, we already have a dictionary that handles these cases. You don't even have to type: d[id(k)]=value, it would suffice to write: d[k]=value. Sorry, but I think this idea is a total waste. Perhaps post it as a recipe, but it doesn't make sense to try to inject it into the standard library. Raymond

On 2010-06-01 03:23, Raymond Hettinger wrote:
On May 31, 2010, at 5:45 PM, Benjamin Peterson wrote:
Essentially these are places where defined equality should not matter.
"should not matter" is the important part here. It might have been clearer to say "should be ignored" instead. I think Raymond is misunderstanding it.
It is necessary, because the objects involved might define their own __hash__ and __cmp__/__eq__, and these should *not* be used.
I don't think it is a total waste, but I have seen two ideas in this thread that I find more generally useful. One is "collections.keyfuncdict", which could be trivially used as an identitydict. The other is a WeakIdentityDict, which is a WeakKeyDict that uses only the identity of the keys for hashing/equality. These two are independent, one cannot be used to implement the other (unless collections.keyfuncdict grows an option to not keep strong refs to the keys, perhaps by providing the inverse keyfunc instead). Anyway, +0.1 on identitydict and +1 on each of collection.keyfuncdict and WeakIdentityDict. - Jacob

On Sun, May 30, 2010 at 12:07 PM, Lie Ryan <lie.1296@gmail.com> wrote:
My 2 bits: 1. I implemented and used such a dict a few years ago, as part of a graph/tree algorithm. While I don't have access to the source anymore, I mainly wanted to bring up an example for the usage of such a data structure. Still, I haven't needed it since, so take from it what you will.
what's wrong with dict[id(key)] = foo?
2. Mostly that you want other operators to work as well. In your example, "key in dic" will return False, as __contains__ is the standard implementation. Cheers, Imri -- Imri Goldberg -------------------------------------- http://plnnr.com/ - automatic trip planning http://www.algorithm.co.il/blogs/ -------------------------------------- -- insert signature here ----

On May 31, 2010, at 7:31 PM, Benjamin Peterson wrote:
Benjamin, could you elaborate of several points that are unclear: * If id() is expensive in PyPy, then how are they helped by the code in http://codespeak.net/svn/pypy/trunk/pypy/lib/identity_dict.py which uses id() for the gets and sets and contains? * In the examples you posted (such as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ), it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module? * Most of the posted examples already work with regular dicts (which check identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python: >>> class A: ... pass >>> a = A() >>> d = {a: 10} >>> assert d[a] == 10 # uses a's identity for lookup * Is the proposal something needed for all implementations or is it just an optimization for a particular, non-CPython implementation? Raymond

Raymond Hettinger <raymond.hettinger@...> writes:
At the top of that file, it imports from the special module __pypy__ which contains an optimized version of the dict.
* In the examples you posted (such
as http://codespeak.net/svn/pypy/trunk/pypy/tool/algo/graphlib.py ),
it appears that PyPy already has an identity dict, so how are they helped by adding one to the collections module?
My purpose with those examples was to prove it as a generally useful utility.
* Most of the posted examples already work with regular dicts (which check
identity before they check equality) -- don't the other implementations already implement regular dicts which need to have identity-implied-equality in order to pass the test suite? I would expect the following snippet to work under all versions and implementations of Python:
Yes, but that would be different if you have two "a"s with __eq__ defined to be equal and you want to hash them separately.
* Is the proposal something needed for all implementations or is it just an
optimization for a particular, non-CPython implementation? My contention is that an identity dictionary or at least a dictionary with custom hash and keys is a useful primitive that should be in the standard library. However, I also see its advantage in avoiding bad performance of id() based identity dicts in non-CPython implementations. It is useful to let the implementation optimize it any time there is moving GC as in Jython and IronPython where id also is expensive. (Basically a mapping has to be maintained for all objects on which id is called.)

2010/6/1 Benjamin Peterson <benjamin@python.org>:
Here is how I designed this for my language: You can request the ObjectId of the given object. If an ObjectId corresponding to the given object is still alive, you always get it back again, but it can be GC'ed and later created afresh. ObjectIds are hashable and comparable (with an arbitrary ordering). Hash values and the ordering are preserved when ObjectIds are kept alive, but they may be different if ObjectIds are created afresh. An ObjectId contains an integer index which is unique among ObjectIds being alive at the same time. You can make a dictionary with a specified key function. It is internally backed by something equivalent to f(k) -> (k, v) dict. A dictionary with ObjectId constructor as the key is an identity dictionary; it works because it keeps both k and f(k) alive. An advantage of this scheme is that with a moving GC the id mapping must be maintained only for objects for which the program keeps their ObjectIds alive. A disadvantage is that the program must be careful to not use ObjectIds in a manner which does not keep them alive yet expects consistent hashing and ordering. In particular a key-mapped dict which would store only (k, v) pairs and compute f(k) on the fly would not work. Also ObjectIds cannot be used to generate printable unique identifiers which would be valid without having to keep ObjectIds alive, like in Python's default repr. -- Marcin Kowalczyk

None of the presented examples take advantage of that property. All of them work with regular dictionaries. This proposal is still use case challenged. AFAICT from code searches, the idea of needing to override an existing __eq__ with an identity-only comparison seems to never come up. It would not even be popular as an ASPN recipe. Moreover, I think that including it in the standard library would be harmful. The language makes very few guarantees about object identity. In most cases a user would far better off using a regular dictionary. If a rare case arose where __eq__ needed to be overridden with an identity-only check, it is not hard to write d[id(obj)]=value. Strong -1 on including this in the standard library. Raymond P.S. ISTM that including subtly different variations of a data type does more harm than good. Understanding how to use an identity dictionary correctly requires understanding the nuances of object identity, how to keep the object alive outside the dictionary (even if the dictionary keeps it alive, a user still needs an external reference to be able to do a lookup), and knowing that the version proposed for CPython has dramatically worse speed/space performance than a regular dictionary. The very existence of an identity dictionary in collections is likely to distract a user away from a better solution using: d[id(obj)]=value.

Raymond Hettinger <raymond.hettinger@...> writes:
Besides that ASPN recipe of mine I mentioned before [1], here are some more examples: - copy.deepcopy() and pickle use it for an object memo. - keeping track of protocol versions in Twisted. [2] - memo in a serialization protocol. [3] [1] http://code.activestate.com/recipes/577242-calling-c-level-finalizers- without-__del__/ [2] http://twistedmatrix.com/trac/browser/trunk/twisted/persisted/ styles.py [3] http://twistedmatrix.com/trac/browser/trunk/twisted/spread/jelly.py
On the other hand, d[id(obj)] can be dangerous and incorrect on CPython:
Strong -1 on including this in the standard library.
How do you feel about Antoine's keyfuncdict proposal?
We're all adults here. We provide WeakKeyedDict and friends even though they rely on unpredictable and subtle garbage collection.
I would argue that that's not a better solution given the above example. Anyone using id(obj) would have understand the nuances of object identity perhaps more than a real identity dictionary.

On Jun 2, 2010, at 10:38 AM, Benjamin Peterson wrote:
Strong -1 on including this in the standard library.
How do you feel about Antoine's keyfuncdict proposal?
His proposal is more interesting because it is more general. For example, a keyfuncdict may make it easy to implement a case-insensitive dictionary. I would like to see a concrete implementation to see what design choices it makes. Raymond

On Jun 2, 2010, at 9:37 AM, Raymond Hettinger wrote:
Using id() is a workaround but again, a potentially expensive one for platforms with moving GCs. Every object calling for an id() forces additional bookkeeping on their ends. This is only a better solution for CPython. Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.
No, the default hash backed by id is a CPython implementation detail. Another use case is just the fact that Python allows you to completely change the semantics of __eq__ (and for good reason). Though this is rare, take SQLAlchemy's SQL expression DSL for example, that has it generate a where clause: table.select(table.c.id == 4) # table.c.id == 4 returns a "<sql statement> where id == 4" object I don't see how a platform like Jython can provide an optimized identitydict that avoids id() calls via keyfuncdict(key=id). The keys() of said dict would need to be actual results of id() calls. I'm +1 on an identitydict as long as its CPython implementation doesn't provide worse performance than the id workaround. -- Philip Jenvey

To be clear, most the examples given so far work with regular dictionaries even without using id(). The exception was something system specific such as the pickling mechanism. So what we're talking about is the comparatively rare case when an object has an __eq__ method and you want that method to be ignored. For example, you have two tuples (3,5) and (3,5) which are equal but happen to be distinct in memory and your needs are: * to treat the two equal objects as being distinct for some purpose * to run faster than id() runs on non-CPython implementations * don't care if the code is dog slow on CPython (i.e. slower than if you had used id()) * don't care that the two tuples being distinct is memory is not a guaranteed behavior across implementations (i.e. any implementation is free to make all equal tuples share the same id via interning) FWIW, I spoke with Jim Baker about this yesterday and he believes that Jython has no need for an identity dict. Raymond P.S. If Antoine's keyfuncdict proposal gains traction, it would be possible for other implementations to create a fast special case for key=id.

On Thu, 3 Jun 2010 12:02:44 -0700 Philip Jenvey <pjenvey@underboss.org> wrote:
Well, CPython and all other implementations with a non-moving GC.
Whereas abstracting this out into an identitydict type gives all platforms the chance to provide their own optimized versions.
That's really premature optimization. Regards Antoine.

+1 - this has been useful to me in other languages. (apologies for top-post, sent from phone.) Michael Foord -- http://www.ironpythoninaction.com On 30 May 2010, at 04:27, Benjamin Peterson <benjamin@python.org> wrote:

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
Perhaps it would be more useful to add a generic collections.keyfuncdict, taking a function which applied to a key gives the real key value used for lookups. Your identity dict would be created as: d = collections.keyfuncdict(id) But of course, you can just use a normal dict: d = {} d[id(key)] = key, value (actually, this could be how a collections.keyfuncdict gets implemented) Regards Antoine.

On Sun, 30 May 2010 14:27:05 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
Just keyfuncdict(None) or keyfuncdict(id), then. I'm not proposing that the hash() and eq() function be passed, but really a factory function which is used for computing the actual lookup key when doing inserts and lookups (which then implicitly use the natural hash and eq functions for that lookup key, as a normal dict does). Regards Antoine.

On Sun, May 30, 2010 at 6:34 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For what it's worth, my sorteddict class (part of the blist library) already supports exactly that syntax. The full signature is: sorteddict([key[, arg]], **kw) where "arg" and "kw" do exactly what they do for dict(), and "key" specifies a key function just as in your example. (Although now that I think about it, perhaps "arg" and "key" should be reversed to more closely match the signature of dict()) Example usage:
I'm not suggesting that Benjamin use sorteddict for his use-case, since keeping the keys sorted by id() is probably not useful. ;-) In some ways it would be nice if dict() itself accepted a "key" function as an argument, but I don't see a way to provide that functionality without adding another field to PyDictObject (yuck) or abusing the ma_lookup field (again, yuck). I favor adding the more general keyfuncdict over identitydict. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
This is how Lua works, systematically. To make this work, all "primitive" values (non-table, since table is the single container type) are interned. Thus, for "values", equality and identity are equivalent. Tables instead are considered referenced objects, even when they are used as collections. So that Lua can quietly hash the ref instead of the value. * simplicity * performance * anything can be used as key The drawback is that any structured value, eg an x,y position stored in a table, is compared by reference. A lookup with a position used as key thus cannot return an item stored with an equal key:
For the scheme to really extend to any use case, there should be a kind of mark allowing the programmer to state whether a table is, conceptually, a value (simply happens to be structured) or a referenced thing. Value tables should then be interned like other values -- but this is certainly difficult and costly. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

Denis, if you're going to post to python-ideas, would you mind taking that biohazard symbol out of your user name? My Emacs-based mail reader thrashes for quite a while trying to find a glyph for it before it gives up and renders it as a hollow rectangle. I'd normally just add you to my kill-file, but I hate to give up on python-ideas people that fast. I'm sure you're not really a biohazard :-). Bill

On Sun, 30 May 2010 03:27:53 +0000 (UTC) Benjamin Peterson <benjamin@python.org> wrote:
The main point is certainly that the limitation of "hashability" disappears... Everything can be used as key, if actually the ref is hashed instead of the value. Denis ________________________________ vit esse estrany ☣ spir.wikidot.com

What is the use case for such a collection? (This will help answer questions about how stuff like comparing two identity dicts or an identity dict and another mapping will work.) By default, objects __eq__ and __hash__ methods allow them to be used in dicts by identity. When I define __eq__ to mean something else, I find that I don't typically have something I can really store in a dict with the behavior I want. Should an ID Dict really be a weak ref identity dict? The one time I wanted an ID dict is when I was writing a memoization routine for methods, where I wanted to store the caches for each object. (Incidentally, I ended up refactoring not to require this anyhow.) Is this the main basic use case? Am I missing something? Best regards, Mike On Sat, May 29, 2010 at 10:27 PM, Benjamin Peterson <benjamin@python.org> wrote:

Benjamin Peterson wrote:
Are you sure this is a good idea for the stdlib ? id(obj) gives you the current storage address of an object in CPython and these can be reused over time. Using a key object to such a dict would not secure it's persistence throughout the usage lifetime in that dict, since d[id(key)] = 1 doesn't increase the refcount of key. Such reuse is bound to happen often for some Python object types, due to the free-lists we're using in CPython and the Python allocator which is optimized for reuse of fixed-size memory chunks. For experts, such a dictionary may have some value (e.g. to work around the hash() requirement of normal dictionaries), but I feel that newbies would have a hard time understanding all the implications. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 30 2010)
2010-07-19: EuroPython 2010, Birmingham, UK 49 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (20)
-
Antoine Pitrou
-
Benjamin Peterson
-
Bill Janssen
-
Daniel Stutzbach
-
Greg Ewing
-
Imri Goldberg
-
Jack Diederich
-
Jacob Holm
-
Lie Ryan
-
M.-A. Lemburg
-
Marcin 'Qrczak' Kowalczyk
-
Masklinn
-
Mathias Panzenböck
-
Michael
-
Mike Graham
-
Nick Coghlan
-
Philip Jenvey
-
Raymond Hettinger
-
spir ☣
-
Terry Reedy