Mailman 3 gc ideas -- sparse memory - Python-Dev

gc ideas -- sparse memory

older
Re: [Python-Dev] [Python-checkins]...

Dima Tisnek

3 Dec 2010 3 Dec '10

4:55 p.m.

How hard or reasonable would it be to free memory pages on OS level? [pcmiiw] Gabage collection within a generation involves moving live objects to compact the generation storage. This separates the memory region into 2 parts "live" and "cleared", the pointer to the beginning of the "cleared" part is where next allocation is going to happen. When this is done, does Python gc move the objects preserving order or does it try to populate garbaged slot with some live object disregarding order? Earlier case is more applicable, but latter case is a target for below too. If we were to look at memory regions from the OS point of view, they are allocated as pages (or possibly as hugetlb pages). So if we are to compact something like this [LL__][_L__][____][L_LL][LFFF] where [] is a page, L is live object and _ is garbage and F is free memory, would it not make more sense to tell OS that [____] is not needed anymore, and not move some of the consequtive [L_LL][LFFF] at all, or at least not move those objects as far down the memory region? This would of course have a certain overhead of tracking which pages are given back to OS and mapping them back when needed, but at the same time, be beneficial because fewer objets are moved and also possibly improve cpu cache performance because objects won't be moved so far out. p.s. if someone has an athoritative link to modern python gc design, please let me know.

Show replies by thread

Terry Reedy

3 Dec 3 Dec

5:04 p.m.

On 12/3/2010 5:55 PM, Dima Tisnek wrote:

...

How hard or reasonable would it be to free memory pages on OS level?

[pcmiiw] Gabage collection within a generation involves moving live objects to compact the generation storage. This separates the memory region into 2 parts "live" and "cleared", the pointer to the beginning of the "cleared" part is where next allocation is going to happen.

When this is done, does Python gc move the objects preserving order or does it try to populate garbaged slot with some live object disregarding order? Earlier case is more applicable, but latter case is a target for below too.

If we were to look at memory regions from the OS point of view, they are allocated as pages (or possibly as hugetlb pages). So if we are to compact something like this [LL__][_L__][____][L_LL][LFFF] where [] is a page, L is live object and _ is garbage and F is free memory, would it not make more sense to tell OS that [____] is not needed anymore, and not move some of the consequtive [L_LL][LFFF] at all, or at least not move those objects as far down the memory region?

This would of course have a certain overhead of tracking which pages are given back to OS and mapping them back when needed, but at the same time, be beneficial because fewer objets are moved and also possibly improve cpu cache performance because objects won't be moved so far out.

p.s. if someone has an athoritative link to modern python gc design, please let me know.

gc is implementation specific. CPython uses ref counting + cycle gc. A constraint on all implementations is that objects have a fixed, unique id during their lifetime. CPython uses the address as the id, so it cannot move objects. Other implementations do differently. Compacting gc requires an id to current address table or something. -- Terry Jan Reedy

James Y Knight

5:15 p.m.

On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote:

...

gc is implementation specific. CPython uses ref counting + cycle gc. A constraint on all implementations is that objects have a fixed, unique id during their lifetime. CPython uses the address as the id, so it cannot move objects. Other implementations do differently. Compacting gc requires an id to current address table or something.

It's somewhat unfortuante that python has this constraint, instead of the looser: "objects have a fixed id during their lifetime", which is much easier to implement, and practically as useful. James

"Martin v. Löwis"

5:15 p.m.

Am 03.12.2010 23:55, schrieb Dima Tisnek:

...

How hard or reasonable would it be to free memory pages on OS level?

Very easy. Python already does that.

...

[pcmiiw] Gabage collection within a generation involves moving live objects to compact the generation storage. This separates the memory region into 2 parts "live" and "cleared", the pointer to the beginning of the "cleared" part is where next allocation is going to happen.

I think you are talking about copying collectors here. This is not how Python's garbage collection works.

...

When this is done, does Python gc move the objects preserving order or does it try to populate garbaged slot with some live object disregarding order? Earlier case is more applicable, but latter case is a target for below too.

(C)Python's garbage collector is not moving objects at all.

...

If we were to look at memory regions from the OS point of view, they are allocated as pages (or possibly as hugetlb pages). So if we are to compact something like this [LL__][_L__][____][L_LL][LFFF] where [] is a page, L is live object and _ is garbage and F is free memory, would it not make more sense to tell OS that [____] is not needed anymore, and not move some of the consequtive [L_LL][LFFF] at all, or at least not move those objects as far down the memory region?

See above. Python does no moving of objects whatsoever. Regards, Martin

Dima Tisnek

5:19 p.m.

Oh my bad, I must've confused python with some research paper. Unique id is not so hard to make without an address. While on this topic, what is the real need for unique ids? Also I reckon not all objects need a unique id like this, e.g. interned strings, simple data types and hashable and comparable objects could perhaps survive without unique id? On 3 December 2010 16:04, Terry Reedy wrote:

...

On 12/3/2010 5:55 PM, Dima Tisnek wrote:

...
How hard or reasonable would it be to free memory pages on OS level?

[pcmiiw] Gabage collection within a generation involves moving live objects to compact the generation storage. This separates the memory region into 2 parts "live" and "cleared", the pointer to the beginning of the "cleared" part is where next allocation is going to happen.

When this is done, does Python gc move the objects preserving order or does it try to populate garbaged slot with some live object disregarding order? Earlier case is more applicable, but latter case is a target for below too.

If we were to look at memory regions from the OS point of view, they are allocated as pages (or possibly as hugetlb pages). So if we are to compact something like this [LL__][_L__][____][L_LL][LFFF] where [] is a page, L is live object and _ is garbage and F is free memory, would it not make more sense to tell OS that [____] is not needed anymore, and not move some of the consequtive [L_LL][LFFF] at all, or at least not move those objects as far down the memory region?

This would of course have a certain overhead of tracking which pages are given back to OS and mapping them back when needed, but at the same time, be beneficial because fewer objets are moved and also possibly improve cpu cache performance because objects won't be moved so far out.

p.s. if someone has an athoritative link to modern python gc design, please let me know.

gc is implementation specific. CPython uses ref counting + cycle gc. A constraint on all implementations is that objects have a fixed, unique id during their lifetime. CPython uses the address as the id, so it cannot move objects. Other implementations do differently. Compacting gc requires an id to current address table or something.

-- Terry Jan Reedy

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/dimaqq%40gmail.com

"Martin v. Löwis"

5:45 p.m.

...

Oh my bad, I must've confused python with some research paper. Unique id is not so hard to make without an address.

While on this topic, what is the real need for unique ids?

They are absolutely needed for mutable objects. For immutable ones, it would be ok to claim that they are identical if they are equal (assuming they support equality - which is tricky for things like NaN). Of course, the C API has lots of assumptions that identity and address are really the same thing. Regards, Martin

Terry Reedy

6:05 p.m.

On 12/3/2010 6:15 PM, James Y Knight wrote:

...

On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote:

...
gc is implementation specific. CPython uses ref counting + cycle gc. A constraint on all implementations is that objects have a fixed, unique id during their lifetime. CPython uses the address as the id, so it cannot move objects. Other implementations do differently. Compacting gc requires an id to current address table or something.

I left out that the id must be an int.

...

It's somewhat unfortuante that python has this constraint, instead of the looser: "objects have a fixed id during their lifetime", which is much easier to implement, and practically as useful.

Given that the only different between 'fixed and unique' and 'fixed' is the uniqueness part, I do not understand 'practically as useful'. Duplicate ids (in the extreme, that same for all objects) hardly seem useful at all. -- Terry Jan Reedy

James Y Knight

6:46 p.m.

On Dec 3, 2010, at 7:05 PM, Terry Reedy wrote:

...

I left out that the id must be an int.

...
It's somewhat unfortuante that python has this constraint, instead of the looser: "objects have a fixed id during their lifetime", which is much easier to implement, and practically as useful.

Given that the only different between 'fixed and unique' and 'fixed' is the uniqueness part, I do not understand 'practically as useful'. Duplicate ids (in the extreme, that same for all objects) hardly seem useful at all.

Sure they are. This is what Java provides you, for example. If you have fixed, but potentially non-unique ids (in Java you get this using "identityHashCode()"), you can still make an identity hashtable. You simply need to *also* check using "is" that the two objects really are the same one after finding the hash bin using id. It'd be a quality of implementation issue whether an implementation gives you the same value for every object. It should not, of course, if it wants programs to have reasonable performance. Same sort of thing as how __hash__ should not return 0 for everything. Besides identity-hashtables, the main other thing id gets used for is to have some identifier string in a printer (e.g. ""). There, it's generally "good enough" to use an id which is not guaranteed to be, but often is, unique. It works for Java... James

Terry Reedy

9:50 p.m.

On 12/3/2010 7:46 PM, James Y Knight wrote:

...

Sure they are. This is what Java provides you, for example. If you have fixed, but potentially non-unique ids (in Java you get this using "identityHashCode()"), you can still make an identity

I do not see the point of calling a (non-unique) hash value the identity

...

hashtable. You simply need to *also* check using "is" that the two

In Python, that unique isness is the identify. (a is b) == (id(a) == id(b)) by definition.

...

objects really are the same one after finding the hash bin using id.

by using the hash value, which is how Python dict operate. -- Terry Jan Reedy

Cameron Simpson

10:06 p.m.

On 03Dec2010 18:15, James Y Knight wrote: | On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote: | > gc is implementation specific. CPython uses ref counting + cycle | > gc. A constraint on all implementations is that objects have a fixed, | > unique id during their lifetime. CPython uses the address as the id, so | > it cannot move objects. Other implementations do differently. Compacting | > gc requires an id to current address table or something. | | It's somewhat unfortuante that python has this constraint, instead of | the looser: "objects have a fixed id during their lifetime", which is | much easier to implement, and practically as useful. Python doesn't have the constraint you imagine; it _does_ have "objects have a fixed id during their lifetime". _CPython_ has this constraint because it uses the address as the id, which is free and needs no papping or extra storage. Of course, it removes certain freedoms from the GC etc as a side effect. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ The number of cylinders for this disk is set to 364737. There is nothing wrong with that, but this is larger than 1024, [...] - fdisk on our new RAID 07oct2007 :-)

James Y Knight

10:34 p.m.

On Dec 3, 2010, at 10:50 PM, Terry Reedy wrote:

...

On 12/3/2010 7:46 PM, James Y Knight wrote:

...
Sure they are. This is what Java provides you, for example. If you have fixed, but potentially non-unique ids (in Java you get this using "identityHashCode()"), you can still make an identity

I do not see the point of calling a (non-unique) hash value the identity

My point was simply that a) it's an unfortunate constraint on potential GC implementations that objects need to have a fixed and unique id in Python, and b) that it's not actually necessary to have such a constraint (in the abstract sense of required; obviously it's a requirement upon Python *today*, due to existing code which depends upon that promise). Would you be happier if I had said "it's unfortunate that Python has an "id" function instead of an "identityHashValue" function? I suppose that's what I really meant. Python the language would not have been harmed had it had from the start an identityHashValue() function instead of an id() function. In the CPython implementation, it may even have had the exact same behavior, but would've allowed other implementations more flexibility. James

Terry Reedy

11:39 p.m.

On 12/3/2010 11:06 PM, Cameron Simpson wrote:

...

On 03Dec2010 18:15, James Y Knight wrote: | On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote: |> gc is implementation specific. CPython uses ref counting + cycle |> gc. A constraint on all implementations is that objects have a fixed, |> unique id during their lifetime. CPython uses the address as the id, so |> it cannot move objects. Other implementations do differently. Compacting |> gc requires an id to current address table or something. | | It's somewhat unfortuante that python has this constraint, instead of | the looser: "objects have a fixed id during their lifetime", which is | much easier to implement, and practically as useful.

Python doesn't have the constraint you imagine; it _does_ have "objects have a fixed id during their lifetime".

"id(object) Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime." Of course, other implementations are free to change builtins, but code that depends on CPython's definitions will not run. -- Terry Jan Reedy

Dima Tisnek

11:47 p.m.

On 3 December 2010 16:45, "Martin v. Löwis" wrote:

...

...
Oh my bad, I must've confused python with some research paper. Unique id is not so hard to make without an address.

While on this topic, what is the real need for unique ids?

They are absolutely needed for mutable objects. For immutable ones, it would be ok to claim that they are identical if they are equal (assuming they support equality - which is tricky for things like NaN).

Indeed, but do ids really need to be unique and fixed at the same time? a is b # (if atomic) needs unique ids, but doesn't really need fixed ids a[b] # needs fixed hash, but not strictly a globally unique id I can imagine an implementaion of pickle for example that uses unique and fixed as a given to detect cycles, etc; but that would be implementation detail. It seems to me unique and fixed id implies that it is stored somewhere (with incref beforehands and decref afterwards), however a proper reference to an object could be used just as well. Am I still missing something?

...

Of course, the C API has lots of assumptions that identity and address are really the same thing.

Regards, Martin

Maciej Fijalkowski

4 Dec 4 Dec

12:21 a.m.

On Sat, Dec 4, 2010 at 6:34 AM, James Y Knight wrote:

...

On Dec 3, 2010, at 10:50 PM, Terry Reedy wrote:

...
On 12/3/2010 7:46 PM, James Y Knight wrote:

...
Sure they are. This is what Java provides you, for example. If you have fixed, but potentially non-unique ids (in Java you get this using "identityHashCode()"), you can still make an identity

I do not see the point of calling a (non-unique) hash value the identity

My point was simply that a) it's an unfortunate constraint on potential GC implementations that objects need to have a fixed and unique id in Python, and b) that it's not actually necessary to have such a constraint (in the abstract sense of required; obviously it's a requirement upon Python *today*, due to existing code which depends upon that promise).

Would you be happier if I had said "it's unfortunate that Python has an "id" function instead of an "identityHashValue" function? I suppose that's what I really meant. Python the language would not have been harmed had it had from the start an identityHashValue() function instead of an id() function. In the CPython implementation, it may even have had the exact same behavior, but would've allowed other implementations more flexibility.

James

I don't see how this related to moving vs non-moving GC. PyPy (and I believe IronPython and Java) all have fixed unique ids that are not necesarilly their addresses. The only problem is that id() computed that way is more costly performance-wise, but works.

Steven D'Aprano

12:48 a.m.

James Y Knight wrote:

...

On Dec 3, 2010, at 10:50 PM, Terry Reedy wrote:

...
On 12/3/2010 7:46 PM, James Y Knight wrote:

...
Sure they are. This is what Java provides you, for example. If you have fixed, but potentially non-unique ids (in Java you get this using "identityHashCode()"), you can still make an identity I do not see the point of calling a (non-unique) hash value the identity

My point was simply that a) it's an unfortunate constraint on potential GC implementations that objects need to have a fixed and unique id in Python, and b) that it's not actually necessary to have such a constraint (in the abstract sense of required; obviously it's a requirement upon Python *today*, due to existing code which depends upon that promise).

I'm afraid I don't follow you. Unless you're suggesting some sort of esoteric object system whereby objects *don't* have identity (e.g. where objects are emergent properties of some sort of distributed, non-localised "information"), any object naturally has an identity -- itself. It seems to me that an identify function must obey one constraint: * Two objects which exist simultaneously have the same identity if and only if they are the same object i.e. id(x) == id(y) if and only if x is y. Other than that, an implementation is free to make id() behave any way they like. CPython uses the memory location of the object, but Jython and IronPython use an incrementing counter which is never re-used for the lifetime of the Python process. CPython's implementation implies that objects may not be moved in memory, but that's not a language constraint, that's an implementation issue. It seems counter-productive to me to bother with an identity function which doesn't meet that constraint. If id(x) == id(y) implies nothing about x and y (they may, or may not, be the same object) then what's the point? Why would you bother using that function when you could just use x == y instead?

...

Would you be happier if I had said "it's unfortunate that Python has an "id" function instead of an "identityHashValue" function? I suppose that's what I really meant. Python the language would not have been harmed had it had from the start an identityHashValue() function instead of an id() function. In the CPython implementation, it may even have had the exact same behavior, but would've allowed other implementations more flexibility.

No, because I don't see what the point of identityHashValue or why you would ever bother to use it. -- Steven

"Martin v. Löwis"

2:43 a.m.

...

Am I still missing something?

Apparently. The hole C API would break if objects would move in memory. Since they have to stay at fixed addresses, it's easy enough to use the address as ID. There is no problem to be solved here. Regards, Martin

"Martin v. Löwis"

2:59 a.m.

...

I'm afraid I don't follow you. Unless you're suggesting some sort of esoteric object system whereby objects *don't* have identity (e.g. where objects are emergent properties of some sort of distributed, non-localised "information"), any object naturally has an identity -- itself.

Not in Java or C#. It is in these languages possible to determine whether to references refer to the same object. However, objects don't naturally have a distinct identification (be it an integer or something else). If you really want to associate unique numbers with objects in these languages, the common approach is to put them into an identity dictionary as keys.

...

It seems counter-productive to me to bother with an identity function which doesn't meet that constraint. If id(x) == id(y) implies nothing about x and y (they may, or may not, be the same object) then what's the point?

See James' explanation: it would be possible to use this as the foundation of an identity hash table.

...

Why would you bother using that function when you could just use x == y instead?

Because in a hash table, you also need a hash value. Regards, Martn

Steven D'Aprano

4:54 a.m.

Martin v. Löwis wrote:

...

...
I'm afraid I don't follow you. Unless you're suggesting some sort of esoteric object system whereby objects *don't* have identity (e.g. where objects are emergent properties of some sort of distributed, non-localised "information"), any object naturally has an identity -- itself.

Not in Java or C#. It is in these languages possible to determine whether to references refer to the same object. However, objects don't naturally have a distinct identification (be it an integer or something else).

Surely even in Java or C#, objects have an *identity* even if the language doesn't provide a way to query their distinct *identification*. An object stored at one memory location is distinct from any other object stored at different memory locations at that same moment in time, regardless of whether or not the language gives you a convenient label for that identity. Even if that memory location can change during the lifetime of the object, at any one moment, object X is a different object from every other object. The fact that we can even talk about "this object" versus "that object" implies that objects have identity. To put it in Python terms, if the id() function were removed, it would no longer be possible to get the unique identification associated with an object, but you could still compare the identity of two objects using `is`. Of course, I'm only talking about objects. In Java there are values which aren't objects, such as ints and floats. That's irrelevant for our discussion, because Python has no such values.

...

If you really want to associate unique numbers with objects in these languages, the common approach is to put them into an identity dictionary as keys.

...
It seems counter-productive to me to bother with an identity function which doesn't meet that constraint. If id(x) == id(y) implies nothing about x and y (they may, or may not, be the same object) then what's the point?

See James' explanation: it would be possible to use this as the foundation of an identity hash table.

I'm afraid James' explanation didn't shed any light on the question to me. It seems to me that Java's IdentityHashValue[sic -- I think the correct function name is actually IdentityHashCode] is equivalent to Python's hash(), not to Python's id(), and claiming it is related to identity is misleading and confusing. I don't think I'm alone here -- it seems to me that even among Java programmers, the same criticisms have been raised: http://bugs.sun.com/bugdatabase/view_bug.do?bug%5Fid=6321873 http://deepakjha.wordpress.com/2008/07/31/interesting-fact-about-identityhas... Like hash(), IdentityHashCode doesn't make any promises about identity at all. Two distinct objects could have the same hash code, and a further test is needed to distinguish them.

...

...
Why would you bother using that function when you could just use x == y instead?

Because in a hash table, you also need a hash value.

Well, sure, in a hash table you need a hash value. But I was talking about an id() function. So is that it? Is IdentityHashValue (or *Code, as the case may be) just a longer name for hash()? -- Steven

Antoine Pitrou

5:37 a.m.

On Sat, 4 Dec 2010 15:06:45 +1100 Cameron Simpson wrote:

...

On 03Dec2010 18:15, James Y Knight wrote: | On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote: | > gc is implementation specific. CPython uses ref counting + cycle | > gc. A constraint on all implementations is that objects have a fixed, | > unique id during their lifetime. CPython uses the address as the id, so | > it cannot move objects. Other implementations do differently. Compacting | > gc requires an id to current address table or something. | | It's somewhat unfortuante that python has this constraint, instead of | the looser: "objects have a fixed id during their lifetime", which is | much easier to implement, and practically as useful.

Python doesn't have the constraint you imagine; it _does_ have "objects have a fixed id during their lifetime".

_CPython_ has this constraint because it uses the address as the id, which is free and needs no papping or extra storage.

Well, the main reason CPython has this constraint is that the C API mandates constant object pointers. That we can then reuse the pointer for id() is only a consequence of this. Regards Antoine.

"Martin v. Löwis"

6:22 a.m.

...

Surely even in Java or C#, objects have an *identity* even if the language doesn't provide a way to query their distinct *identification*.

When people said "the id of an object should this or that", they always meant identification, not identity (perhaps except for you). Nobody would propose that the identity (in your wording) of two distinct objects might be the same, or might change over time.

...

I'm afraid James' explanation didn't shed any light on the question to me. It seems to me that Java's IdentityHashValue[sic -- I think the correct function name is actually IdentityHashCode] is equivalent to Python's hash()

No, it's not. java.lang.Object.hashCode() is equivalent to Python's hash(). In particular, for both of these, the following requirement holds: object that *compare* equal must hash equal. This is not the case for the identity hash code, or Python's id function: objects that compare equal may still have different ids, or different java identity hash codes.

...

not to Python's id(), and claiming it is related to identity is misleading and confusing.

No, it's right on spot. This is *not* about the regular value hash, but about the identity hash.

...

Like hash(), IdentityHashCode doesn't make any promises about identity at all.

It sure does: calling identityHashCode on the very same object twice will always return the same value - even after a garbage collection. So if two references refer to the same object, calling identityHashCode on both of them gives the same value. This is similar to the relationship between equals and hashCode.

...

Two distinct objects could have the same hash code, and a further test is needed to distinguish them.

Correct. However, references with different identity hash codes will never refer to the same object.

...

...
...
Why would you bother using that function when you could just use x == y instead?

Because in a hash table, you also need a hash value.

Well, sure, in a hash table you need a hash value. But I was talking about an id() function.

So is that it? Is IdentityHashValue (or *Code, as the case may be) just a longer name for hash()?

Of course not. Regards, Martin

Stephen J. Turnbull

6:33 a.m.

Steven D'Aprano writes:

...

Martin v. Löwis wrote:

...

...
...
It seems counter-productive to me to bother with an identity function which doesn't meet that constraint. If id(x) == id(y) implies nothing about x and y (they may, or may not, be the same object) then what's the point?

See James' explanation: it would be possible to use this as the foundation of an identity hash table.

I'm afraid James' explanation didn't shed any light on the question to me. It seems to me that Java's IdentityHashValue[sic -- I think the correct function name is actually IdentityHashCode] is equivalent to Python's hash(), not to Python's id(), and claiming it is related to identity is misleading and confusing.

Not quite equivalent. Python's hash() obeys a certain additional constraint (that numeric values that compare equal have the same hash), and IdentityHashValue presumably won't work at all on numbers in Java (since they aren't objects). Bikeshedding aside, I have to agree with you: there's no point in calling such a function id() or IdentityAnything(). Everything you need to know (including both the fact that it could be an efficient way to search for a unique object, and the possible non-uniqueness of the object assigned that code) is contained in the word "hash".

Maciej Fijalkowski

7:01 a.m.

On Sat, Dec 4, 2010 at 1:37 PM, Antoine Pitrou wrote:

...

On Sat, 4 Dec 2010 15:06:45 +1100 Cameron Simpson wrote:

...
On 03Dec2010 18:15, James Y Knight wrote: | On Dec 3, 2010, at 6:04 PM, Terry Reedy wrote: | > gc is implementation specific. CPython uses ref counting + cycle | > gc. A constraint on all implementations is that objects have a fixed, | > unique id during their lifetime. CPython uses the address as the id, so | > it cannot move objects. Other implementations do differently. Compacting | > gc requires an id to current address table or something. | | It's somewhat unfortuante that python has this constraint, instead of | the looser: "objects have a fixed id during their lifetime", which is | much easier to implement, and practically as useful.

Python doesn't have the constraint you imagine; it _does_ have "objects have a fixed id during their lifetime".

_CPython_ has this constraint because it uses the address as the id, which is free and needs no papping or extra storage.

Well, the main reason CPython has this constraint is that the C API mandates constant object pointers. That we can then reuse the pointer for id() is only a consequence of this.

Regards

Antoine.

Yes. I think further discussion of Java-vs-Python makes no sense (and it's proven that C-API-less python can exist and be compatible)

Stephen J. Turnbull

8:22 a.m.

Quotes are out of order. "Martin v. Löwis" writes:

...

No, it's not. java.lang.Object.hashCode() is equivalent to Python's hash(). In particular, for both of these, the following requirement holds: object that *compare* equal must hash equal.

Aha. I see, now. That is indeed a different concept. Mutable objects retain their identities across mutations.

...

...
Two distinct objects could have the same hash code, and a further test is needed to distinguish them.

Correct. However, references with different identity hash codes will never refer to the same object.

Why is useful to expose an identity hash? AFAICS it is *only* useful in building an identity hash table. If so, why not just provide id() or the is operator or both and be done with it? This is different from exposing the value hash AFAICS, where it may be possible to optimize value hashing of composite objects by computing their hashes from the hashes of subobjects, perhaps omitting some subobjects that don't affect equality (eg, Emacs strings may carry display properties that are ignored when they are compared as strings). With identity, I don't see much point in leaving that up to the programmer. In implementations with stable object addresses, the address is a fine implementation of identity_hash() -- perfect, in fact. Otherwise, you either have to omit mutable members from the computation or store the hashcode in the object. In the former situation, you will likely get very coarse hashing, and the set of immutable members (if any ;-) can be computed by the compiler. In the latter, you may as well store an id directly and avoid the whole hash concept for that type (except as a space optimization, but that again implies coarse hashing).

...

[F]or the identity hash code, or Python's id function: objects that compare equal may still have different ids, or different java identity hash codes.

ISTM that's no problem, you have to do an extra comparison for identity when the codes are equal either way. Of course the identity hash code is a more precise optimization, but that doesn't make hash() unusable for this purpose (cf Sun's built-in implementation of IdentityHashCode that always returns 2). The problem presumably lies in the fact that there are "id()-ifiable" objects that, because they are mutable, either are unhashable (Python hash) or would have an unstable value hash (Lisp sxhash). Ie, AFAICS def identity_hash (thing): "Wrapper for hash() suitable for building an identity hash table." try: return hash(thing) except TypeError: return 0 would be a fine implementation for Python.<wink>

...

...
Like hash(), IdentityHashCode doesn't make any promises about identity at all.

It sure does: calling identityHashCode on the very same object twice will always return the same value - even after a garbage collection.

Well, if you want to put it that way, so does hash(). I read Steven as placing more emphasis on "like hash()" than on "at all".

"Martin v. Löwis"

1:37 p.m.

...

Why is useful to expose an identity hash? AFAICS it is *only* useful in building an identity hash table. If so, why not just provide id() or the is operator or both and be done with it?

That's precisely James' point: Java provides the identity hash *instead* of the id() function (i.e. it does not have an equivalent of id()). Doing so gives greater liberties in implementing Java. For example, an implementation that would only use the type and not the instance for identity hash would still be conforming (as would one that always returns 0).

...

With identity, I don't see much point in leaving that up to the programmer. In implementations with stable object addresses, the address is a fine implementation of identity_hash() -- perfect, in fact.

Of course. James' complaint is that Python-the-language mandates support for an id() function, though - a requirement that even implementations that don't have stable object addresses now must support. If Python mandated only identity hash properties of the id() function, alternative implementations could be simplified.

...

ISTM that's no problem, you have to do an extra comparison for identity when the codes are equal either way. Of course the identity hash code is a more precise optimization, but that doesn't make hash() unusable for this purpose (cf Sun's built-in implementation of IdentityHashCode that always returns 2).

That's only the case if the hash() result is guaranteed not to change. In some applications, it may be desirable to have that as an absolute guarantee (rather than just being a convention). No, you can't substitute identity hash with hash: the value hash of an object may change over time, whereas the identity hash must not. Regards, Martin

Stephen J. Turnbull

5 Dec 5 Dec

2:17 a.m.

"Martin v. Löwis" writes:

...

...
Why is useful to expose an identity hash? AFAICS it is *only* useful in building an identity hash table. If so, why not just provide id() or the is operator or both and be done with it?

That's precisely James' point: Java provides the identity hash *instead* of the id() function (i.e. it does not have an equivalent of id()). Doing so gives greater liberties in implementing Java.

Yes, we understand that it makes the implementer's job easier. *Why bother having an identity hash at all?* Having taken away id() and provided maximum leisure to the implementer via def identity_hash(object): return 42 is there *any* benefit left for the user/developer? All I see is costs: costs in implementation, costs in debugging. And AFAICS this is a problem that can be solved once and reused by everybody who needs id(); why does every developer need to write his own id() function?

Greg Ewing

4:29 p.m.

Martin v. Löwis wrote:

...

The hole C API would break if objects would move in memory. Since they have to stay at fixed addresses, it's easy enough to use the address as ID.

Yes. Some of the discussion here seems to be assuming that the reason Python doesn't move objects is so that it can use the address as a unique ID. But it's the other way around. -- Greg

4888

Age (days ago)

4890

Last active (days ago)

List overview

Download

25 comments

10 participants

participants (10)

"Martin v. Löwis"
Antoine Pitrou
Cameron Simpson
Dima Tisnek
Greg Ewing
James Y Knight
Maciej Fijalkowski
Stephen J. Turnbull
Steven D'Aprano
Terry Reedy

gc ideas -- sparse memory

Cameron Simpson

tags

participants (10)