Identity dicts and sets

I propose to add new standard collection types: IdentityDict and IdentitySet. They are almost same as ordinal dict and set, but uses identity check instead of equality check (and id() or hash(id()) as a hash). They will be useful for pickling, for implementing __sizeof__() for compound types, and for other graph algorithms. Of course, they can be implemented using ordinal dicts: IdentityDict: key -> value as a dict: id(key) -> (key, value) IdentitySet as a dict: id(value) -> value However implementing them directly in the core has advantages, it consumes less memory and time, and more comfortable for use from C. IdentityDict and IdentitySet implementations will share almost all code with implementations of ordinal dict and set, only lookup function and metainformation will be different. However dict and set already use a lookup function overloading.

On Wed, Jan 2, 2013 at 6:01 AM, Serhiy Storchaka <storchaka@gmail.com>wrote:
I agree that the data structures may be useful, but is there no way to some allow the customization of existing data structures instead, without losing performance? It's a shame to have another kind of dict just for this purpose. Eli

On 1/2/2013 9:01 AM, Serhiy Storchaka wrote:
I propose to add new standard collection types: IdentityDict and IdentitySet. They are almost same as ordinal dict and set, but uses
What do you mean by ordinal dict, as opposed to plain dict.
identity check instead of equality check (and id() or hash(id()) as a
By default, equality check is identity check.
hash). They will be useful for pickling, for implementing __sizeof__() for compound types, and for other graph algorithms.
I don't know anything about pickling or __sizeof__, by if one uses user-defined classes for nodes and edges, equality is identity, so I don't see what would be gained. The disadvantage of multiple minor variations on dict is confusion among users as to specific properties and use cases. -- Terry Jan Reedy

On Thu, Jan 3, 2013 at 8:48 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I assumed Serhiy meant OrderedDict.
identity check instead of equality check (and id() or hash(id()) as a
By default, equality check is identity check.
The point of an IdentityDict/Set is for it to be keyed by id rather than value for *all* objects, rather than just those with the default equality comparison. This can be important in some use cases: 1. It's more correct for caching. For example, "0 + 0" should give "0", while "0.0 + 0.0" should give "0.0". An identity based cache will get this right, a value based cache will get it wrong (functools.lru_cache actually splits the difference and goes with a type+value based cache rather than a simple value based cache) 2. It effectively allows you to add additional state to both mutable and immutable objects (by storing the extra state in an identity keyed dictionary). However, one important problem with this kind of data structure is that it is *very* easy to get into lifecycle problems if you don't store at least a weak reference to a real key (since id's may be recycled after an object is destroyed, as shown here:
The disadvantage of multiple minor variations on dict is confusion among users as to specific properties and use cases.
Indeed. As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing. Adding a key function override into that mix suggests that a hashmap factory API might be a better option than continuing the proliferation of slightly different mapping types. (Guido's fears of an explosion in subtly different container types in the standard library once the collections module was added have proved to be well founded). So, -1 from me on making the composition problem worse, but tentative +0 on an API that addresses the composition problem and also includes "key=func" style support for using a decorated value in the lookup step. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, 3 Jan 2013 13:37:40 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm quite sure Serhiy meant ordinary dict.
As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing.
Aren't you dramatizing a bit? I haven't seen anyone ask for an ordered weak dict, or a weak dict with default values.
Well, IdentityDict addresses an actual use case. I don't think a defaultidentitydict addresses any use case. Regards Antoine.

On 03.01.13 05:37, Nick Coghlan wrote:
I assumed Serhiy meant OrderedDict.
Sorry to have confused you. I meant "ordinary dict".
This is not a use case. Two "0" are same key in CPython, but two "1000" or two "0.0" are not. There is yet one "wart" (as in any other language which has identity maps).
Of course, identity dict and set should got an ownership on its keys and values, as all other non-weak collections. Except lookup function they don't differ from their ordinary counterparts.
Indeed. As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing.
I doubt if all combinations have a sense.

On 3 January 2013 12:09, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think what Nick means is that if you implement this naively then you don't hold references to the keys: class IdentityDict(dict): def __setitem__(self, key, val): dict.__setitem__(self, id(key), val) # No reference to key held when this function ends ... A way to fix this is to store both objects in the value (with corresponding changes to __getitem__ etc.): class IdentityDict(dict): def __setitem__(self, key, val): dict.__setitem__(self, id(key), (key, val)) Oscar

Am 03.01.2013 04:37, schrieb Nick Coghlan:
Do you mean +0.0 or -0.0? IEEE 754 zeros are always signed although +0.0 is equal to -0.0. And NaNs are always unequal to all NaNs, even to itself. For floats we would need a type specified dict that handles special values correctly ... Can of worms?

On 03.01.13 00:48, Terry Reedy wrote:
What do you mean by ordinal dict, as opposed to plain dict.
Sorry to have confused you. I mean "ordinary dict", same as "plain dict".
If one uses a list, a dict, or user-defined class with defined __eq__, equality is not identity. Yes, you can use an identity dict with mutable types!

On 1/3/2013 6:51 AM, Serhiy Storchaka wrote:
On 03.01.13 00:48, Terry Reedy wrote:
In my original, the following quote is preceded by the following snipped line. "By default, equality check is identity check."
I don't know anything about pickling or __sizeof__, by if one uses user-defined classes for nodes and edges, equality is identity,
In that context, I pretty obviously meant user-defined class with the default equality as identity. The contrapositive is "If equality is not identity, one is not using a user-defined class with default identity."
If one uses a list, a dict, or user-defined class with defined __eq__, equality is not identity.
Yes, these are are examples of 'not a user-defined class with default identity' in which equality is not identity. I thought it was clear that I know about such things.
Yes, you can use an identity dict with mutable types!
Yes, and my point was that we effectively already have such things. class Node(): pass Instances of Node wrap a dict as .__dict__, but are compared by wrapper identity rather than dict value. A set of such things is effectively an identity set. In 3.3+, if the instances all have the same attributes (if the .__dicts__ all have the same keys), there is only one (not too sparse) hashed list of keys for all instances and one corresponding (not too sparse) list of values for each instance. Also, which I did not say before: if one instead represents nodes by a unique integer or string or by a list that starts with such a unique identifier, then equality is again effectively identity and a set (or sequence) of such things is effectively an identity set. This corresponds to a standard database table where records have keys, so that the identity of records is not lost when reordered or removed from the table.
so I don't see what would be gained.
You are proposing (yet-another) dict variation for use in *python* code. That requires more justification than a few percent speedup in specialized usages. It should make python programming substantially easier in multiple use cases. I do not yet see this in regard to graph algorithm. -- Terry Jan Reedy

On 03.01.13 20:32, Terry Reedy wrote:
Not always you can choose node type. Sometimes nodes already exists and you should just work with them.
so I don't see what would be gained.
You are proposing (yet-another) dict variation for use in *python* code.
In fact I think first of all about C code. Now using identity dict/set idiom is rather cumbersome in C code. With standard IdentityDict it should be so simple as using an ordinary dict.
Identity dict/set idiom used at least in followed stdlib modules: _threading_local, xmlrpc.client, json, lib2to3 (xrange fixer), copy, unittest.mock, idlelib (rpc, remote debugger and browser), ctypes, doctest, pickle, cProfile. May be it is implicitly used in some other places or can be used.

On Wed, Jan 2, 2013 at 6:01 AM, Serhiy Storchaka <storchaka@gmail.com>wrote:
I agree that the data structures may be useful, but is there no way to some allow the customization of existing data structures instead, without losing performance? It's a shame to have another kind of dict just for this purpose. Eli

On 1/2/2013 9:01 AM, Serhiy Storchaka wrote:
I propose to add new standard collection types: IdentityDict and IdentitySet. They are almost same as ordinal dict and set, but uses
What do you mean by ordinal dict, as opposed to plain dict.
identity check instead of equality check (and id() or hash(id()) as a
By default, equality check is identity check.
hash). They will be useful for pickling, for implementing __sizeof__() for compound types, and for other graph algorithms.
I don't know anything about pickling or __sizeof__, by if one uses user-defined classes for nodes and edges, equality is identity, so I don't see what would be gained. The disadvantage of multiple minor variations on dict is confusion among users as to specific properties and use cases. -- Terry Jan Reedy

On Thu, Jan 3, 2013 at 8:48 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I assumed Serhiy meant OrderedDict.
identity check instead of equality check (and id() or hash(id()) as a
By default, equality check is identity check.
The point of an IdentityDict/Set is for it to be keyed by id rather than value for *all* objects, rather than just those with the default equality comparison. This can be important in some use cases: 1. It's more correct for caching. For example, "0 + 0" should give "0", while "0.0 + 0.0" should give "0.0". An identity based cache will get this right, a value based cache will get it wrong (functools.lru_cache actually splits the difference and goes with a type+value based cache rather than a simple value based cache) 2. It effectively allows you to add additional state to both mutable and immutable objects (by storing the extra state in an identity keyed dictionary). However, one important problem with this kind of data structure is that it is *very* easy to get into lifecycle problems if you don't store at least a weak reference to a real key (since id's may be recycled after an object is destroyed, as shown here:
The disadvantage of multiple minor variations on dict is confusion among users as to specific properties and use cases.
Indeed. As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing. Adding a key function override into that mix suggests that a hashmap factory API might be a better option than continuing the proliferation of slightly different mapping types. (Guido's fears of an explosion in subtly different container types in the standard library once the collections module was added have proved to be well founded). So, -1 from me on making the composition problem worse, but tentative +0 on an API that addresses the composition problem and also includes "key=func" style support for using a decorated value in the lookup step. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, 3 Jan 2013 13:37:40 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm quite sure Serhiy meant ordinary dict.
As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing.
Aren't you dramatizing a bit? I haven't seen anyone ask for an ordered weak dict, or a weak dict with default values.
Well, IdentityDict addresses an actual use case. I don't think a defaultidentitydict addresses any use case. Regards Antoine.

On 03.01.13 05:37, Nick Coghlan wrote:
I assumed Serhiy meant OrderedDict.
Sorry to have confused you. I meant "ordinary dict".
This is not a use case. Two "0" are same key in CPython, but two "1000" or two "0.0" are not. There is yet one "wart" (as in any other language which has identity maps).
Of course, identity dict and set should got an ownership on its keys and values, as all other non-weak collections. Except lookup function they don't differ from their ordinary counterparts.
Indeed. As noted elsewhere, we already have a nasty composition problem between __missing__, order preservation and weak referencing.
I doubt if all combinations have a sense.

On 3 January 2013 12:09, Serhiy Storchaka <storchaka@gmail.com> wrote:
I think what Nick means is that if you implement this naively then you don't hold references to the keys: class IdentityDict(dict): def __setitem__(self, key, val): dict.__setitem__(self, id(key), val) # No reference to key held when this function ends ... A way to fix this is to store both objects in the value (with corresponding changes to __getitem__ etc.): class IdentityDict(dict): def __setitem__(self, key, val): dict.__setitem__(self, id(key), (key, val)) Oscar

Am 03.01.2013 04:37, schrieb Nick Coghlan:
Do you mean +0.0 or -0.0? IEEE 754 zeros are always signed although +0.0 is equal to -0.0. And NaNs are always unequal to all NaNs, even to itself. For floats we would need a type specified dict that handles special values correctly ... Can of worms?

On 03.01.13 00:48, Terry Reedy wrote:
What do you mean by ordinal dict, as opposed to plain dict.
Sorry to have confused you. I mean "ordinary dict", same as "plain dict".
If one uses a list, a dict, or user-defined class with defined __eq__, equality is not identity. Yes, you can use an identity dict with mutable types!

On 1/3/2013 6:51 AM, Serhiy Storchaka wrote:
On 03.01.13 00:48, Terry Reedy wrote:
In my original, the following quote is preceded by the following snipped line. "By default, equality check is identity check."
I don't know anything about pickling or __sizeof__, by if one uses user-defined classes for nodes and edges, equality is identity,
In that context, I pretty obviously meant user-defined class with the default equality as identity. The contrapositive is "If equality is not identity, one is not using a user-defined class with default identity."
If one uses a list, a dict, or user-defined class with defined __eq__, equality is not identity.
Yes, these are are examples of 'not a user-defined class with default identity' in which equality is not identity. I thought it was clear that I know about such things.
Yes, you can use an identity dict with mutable types!
Yes, and my point was that we effectively already have such things. class Node(): pass Instances of Node wrap a dict as .__dict__, but are compared by wrapper identity rather than dict value. A set of such things is effectively an identity set. In 3.3+, if the instances all have the same attributes (if the .__dicts__ all have the same keys), there is only one (not too sparse) hashed list of keys for all instances and one corresponding (not too sparse) list of values for each instance. Also, which I did not say before: if one instead represents nodes by a unique integer or string or by a list that starts with such a unique identifier, then equality is again effectively identity and a set (or sequence) of such things is effectively an identity set. This corresponds to a standard database table where records have keys, so that the identity of records is not lost when reordered or removed from the table.
so I don't see what would be gained.
You are proposing (yet-another) dict variation for use in *python* code. That requires more justification than a few percent speedup in specialized usages. It should make python programming substantially easier in multiple use cases. I do not yet see this in regard to graph algorithm. -- Terry Jan Reedy

On 03.01.13 20:32, Terry Reedy wrote:
Not always you can choose node type. Sometimes nodes already exists and you should just work with them.
so I don't see what would be gained.
You are proposing (yet-another) dict variation for use in *python* code.
In fact I think first of all about C code. Now using identity dict/set idiom is rather cumbersome in C code. With standard IdentityDict it should be so simple as using an ordinary dict.
Identity dict/set idiom used at least in followed stdlib modules: _threading_local, xmlrpc.client, json, lib2to3 (xrange fixer), copy, unittest.mock, idlelib (rpc, remote debugger and browser), ctypes, doctest, pickle, cProfile. May be it is implicitly used in some other places or can be used.
participants (7)
-
Antoine Pitrou
-
Christian Heimes
-
Eli Bendersky
-
Nick Coghlan
-
Oscar Benjamin
-
Serhiy Storchaka
-
Terry Reedy