Re: [pypy-dev] Object identity and dict strategies
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs? Laura
On 07/10/2011 09:13 PM Laura Creighton wrote:
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs? IMO taking the id should increment the object ref counter and prevent the garbage collection, until the id value itself is garbage collected. The obvious way would be to make id an object that keeps a reference to the object whose id it represents. See below[1] for an example (just for discussion illustration).
Of course in low level access, conventions of ownership can sometimes safely optimize away actual ref incr/decr, but it sounds like your example proposes "taking an id" after ref count has gone to zero. That's like doing a reinterpret_cast to integer of a malloc pointer after the area's been freed, and expecting it to mean something. It should be an enforced nono, if you ask me. The example id attempts to make all equivalent immutables of the same type have the same id, by taking advantage of dict's key comparison properties and the .setdefault method. to get something like the current, with just the object reference to make an id hold its object, use IdHolder in place of Id. But then you can get all kinds of ids for the same immutable value, as with the old id. _______________________________________________________________________________ # idstuff.py -- a concept-exploring toy re "id" # 2011-07-10 22:58:46 bokr # class Id(object): refval = id # old id function objcache = {} # to store typed immutables for ref to first of equal-valued encountered # and using old id of the first as id for all of same type and equal value. def __init__(self, obj): self.obj = obj def id(self): so=self.obj if type(so) in (int,float,bool,tuple,str,unicode): # etc? t = (type(so), so) return self.refval(self.objcache.setdefault(t, t)[1]) else: return self.refval(so) # as now? XXX what is abstract meaning of id(some_mutable)? def __eq__(self, other): if type(self) != type(other): raise TypeError('Id instances can only be compared with each other,' 'not to "%s" instances.'% type(other).__name__) tobj=type(self.obj) tother=type(other.obj) if tobj != tother: return False return self.id() == other.id() def __repr__(self): return '<Id(%r)==%i vs old %i>'%(self.obj, self.id(), self.refval(self.obj)) def __str__(self): return '<Id(%s)>'%(self.obj,) class IdHolder(object): refval = id # old id function def __init__(self, obj): self.obj = obj self.id = self.refval(obj) def __eq__(self, other): if type(self) != type(other): raise TypeError('IdHolder instances can only be compared with each other,' ' not to "%s" instances.'% type(other).__name__) return self.id == other.id def __repr__(self): return '<IdHolder(%r)==%i>'%(self.obj, self.id) def __str__(self): return '<IdHolder(%s)>'%(self.obj,) _______________________________________________________________________________ Python 2.7.2 (default, Jul 8 2011, 23:38:53) [GCC 4.1.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
oldid=id from ut.idstuff import IdHolder as id from ut.idstuff import Id as idk # with caching for equal (type,value) immutables oldid(2000),oldid(2000),oldid(20*100) (136189164, 136189164, 136189176) id(2000),id(2000),id(20*100) # no k cacheing (<IdHolder(2000)==136189164>, <IdHolder(2000)==136189164>, <IdHolder(2000)==136189212>) idk(2000),idk(2000),idk(20*100) # with k cacheing (<Id(2000)==136189236 vs old 136189236>, <Id(2000)==136189236 vs old 136189236>, <Id(2000)==136189236 vs old 136189140>)
oldid([]),oldid([]) # dangling pointer value returned (3083895948L, 3083895948L) id([]),id([]) # pointer kept live (<IdHolder([])==3083895948>, <IdHolder([])==3083896108>) idk([]),idk([]) # pointer kept live, constant caching n/a (<Id([])==3083784300 vs old 3083784300>, <Id([])==3083896908 vs old 3083896908>)
Have fun ;-) Regards Bengt Richter
Laura
=
On 11 July 2011 20:29, Bengt Richter <bokr@oz.net> wrote:
On 07/10/2011 09:13 PM Laura Creighton wrote:
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs?
IMO taking the id should increment the object ref counter and prevent the garbage collection, until the id value itself is garbage collected.
This significantly changes the meaning of id() in a way that will break existing code. If you want an object reference, just use one. If you want them to be persistent, build a dictionary from id to object. You can already do this yourself in pure python, and it doesn't have the side-effect of bloating id(). Otherwise, such a suggestion should go through the usual process for such a significant change to a language primitive. -- William Leslie
On 07/11/2011 01:36 PM William ML Leslie wrote:
On 11 July 2011 20:29, Bengt Richter<bokr@oz.net> wrote:
On 07/10/2011 09:13 PM Laura Creighton wrote:
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs?
IMO taking the id should increment the object ref counter and prevent the garbage collection, until the id value itself is garbage collected.
This significantly changes the meaning of id() in a way that will break existing code.
Do you have an example of existing code that depends on the integer-cast value of a dangling pointer?? Or do you mean that id's must be allowed to be compared == to integers, which my example prohibits? (I didn't define __cmp__, BTW, just lazy ;-)
If you want an object reference, just use one. If you want them to be persistent, build a dictionary from id to object. Yes, dictionary is one way to bind an object and thus make sure its id is valid.
But it would be overkill to use a dictionary to guarantee object id persistence just for the duration of an expression such as id(x.a) == id(y.a) It might be unusual, but the .a could be a property returning a dynamic value and you might be testing to see if the two return the same object, as they might if the the property get caches such values. Perhaps it is a test to verify that you have the caching version of the app. Artificial example perhaps, but current id could give false results, with id just pointing to the same dead temp space. That's an example of code that would *fail* with the current id, and be ok with id as either of Id or IdHold. As it stands, the integer returned by id persists during evaluation of an expression at least, but its validity does not necessarily last with the value even that long, as we see from the perverse (but easily explained) example:
id([0]) == id([1]) True id([0]), id([1]) (3084230700L, 3084230700L)
So at a minimum, I would think the documentation should say that an id call may return a value implicitly referencing garbage, besides hinting that there may be peculiarities about id-ing some objects.
You can already do this yourself in pure python, and it doesn't have the side-effect of bloating id(). My examples *are* in pure python ;-)
Otherwise, such a suggestion should go through the usual process for such a significant change to a language primitive.
Sure, but I only really want to understand the real (well, *intended* ;-) meaning of the id function, so I am putting forth illustrative examples to identify aspects of its current and possible behavior. Also, a new id could live alongside the old ;-) Regards, Bengt Richter
-- William Leslie
On 11 July 2011 23:21, Bengt Richter <bokr@oz.net> wrote:
On 07/11/2011 01:36 PM William ML Leslie wrote:
On 11 July 2011 20:29, Bengt Richter<bokr@oz.net> wrote:
On 07/10/2011 09:13 PM Laura Creighton wrote:
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs?
IMO taking the id should increment the object ref counter and prevent the garbage collection, until the id value itself is garbage collected.
This significantly changes the meaning of id() in a way that will break existing code.
Do you have an example of existing code that depends on the integer-cast value of a dangling pointer??
I mean that id creating a reference will break existing code. id() has always returned an integer, and the existence of some integer in some python code has never prevented some otherwise unrelated object from being collected. Existing code will not make sure that it cleans up the return value of id(), as nowhere has id() ever kept a reference to the object passed in. I know that you are suggesting that id returns something that is /not/ an integer, but that is also a language change. People have always been able to assume that they can % format ids as decimals or hexadecimals.
Or do you mean that id's must be allowed to be compared == to integers, which my example prohibits? (I didn't define __cmp__, BTW, just lazy ;-)
Good, __cmp__ has been deprecated for over 10 years now.
If you want an object reference, just use one. If you want them to be persistent, build a dictionary from id to object.
Yes, dictionary is one way to bind an object and thus make sure its id is valid.
But it would be overkill to use a dictionary to guarantee object id persistence just for the duration of an expression such as id(x.a) == id(y.a)
But id is not about persistence. The lack of persistence is one of its key features. That said, I do think id()'s current behaviour is overkill. I just don't think we can change it in a way that will fit existing usage. And cleaning it up properly is far too much work.
You can already do this yourself in pure python, and it doesn't have the side-effect of bloating id().
My examples *are* in pure python ;-)
As is copy.py. We've seen several examples on this thread where you can build additional features on top of what id() gives you without changing id(). So we've no need to break id() in any of the ways that have been suggested here.
Otherwise, such a suggestion should go through the usual process for such a significant change to a language primitive.
Sure, but I only really want to understand the real (well, *intended* ;-) meaning of the id function, so I am putting forth illustrative examples to identify aspects of its current and possible behavior.
The notion of identity is important in any stateful language. Referential equivalence, which is a slightly more complicated (yet much better defined) idea says that x and y are equivalent when no operation can tell the difference between the two objects. 'is' is an approximation that is at least accurate for mutability of python objects. In order for x to "be" y, assignments like x.__class__ = Foo must have exactly the same effect as y.__class__ = Foo. You could presumably write a type in the implementation language that was in no way discernable from the real x, but if x is y, you *know* there is no difference. What id() does is it attempts to distil 'the thing compared' when 'is' is used. On cpython, it just returned the integer value of the pointer to the object, because on cpython that is cheap and does the job (and hey, it *is* the thing compared when you do 'is' on cpython). On pypy, things are slightly more complicated. Pypy is written in python, which has no concept of pointers. It translates to the JVM and the (safe) CLI, neither of which have a direct analogue of the pointer. And even when C or LLVM is used as the backend, objects may move around in memory. Having id() return different values after a collection cycle would be very confusing. So, pypy implements its own, quite clever mechanism for creating ids. It is described in a blog post, if you'd like to read it. The definition of id(), according to docs.python.org, is: Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
Also, a new id could live alongside the old ;-)
It's just that the problems you are attempting to fix are already solved, and they are only vaguely related to what a python programmer understands id() to mean. If, according to cpython, "1003 is not 1000 + 3", then programmers can't rely on any excellent new behaviour for id() *anyway*. OTOH, the "identity may not even be preserved for primitive types" issue is an observable difference to cpython and is fixable, even if it is a silly thing to rely on. -- William Leslie
On 12 July 2011 00:21, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
Referential equivalence, which is a slightly more complicated (yet much better defined) idea says that x and y are equivalent when no operation can tell the difference between the two objects.
Ack, sorry. I meant Referential Transparency; which is much more googleable! -- William Leslie
On 07/11/2011 04:21 PM William ML Leslie wrote:
On 11 July 2011 23:21, Bengt Richter<bokr@oz.net> wrote:
On 07/11/2011 01:36 PM William ML Leslie wrote:
On 11 July 2011 20:29, Bengt Richter<bokr@oz.net> wrote:
On 07/10/2011 09:13 PM Laura Creighton wrote:
What do we want to happen when somebody -- say in a C extension -- takes the id of an object that is scheduled to be removed when the gc next runs?
IMO taking the id should increment the object ref counter and prevent the garbage collection, until the id value itself is garbage collected.
This significantly changes the meaning of id() in a way that will break existing code.
Do you have an example of existing code that depends on the integer-cast value of a dangling pointer??
I mean that id creating a reference will break existing code. id() has always returned an integer, and the existence of some integer in some python code has never prevented some otherwise unrelated object from being collected. Existing code will not make sure that it cleans up the return value of id(), as nowhere has id() ever kept a reference to the object passed in. Ok, d'oh ;-/
I was focused on making sure the id value "referred" to an existing live object *when returned from id* (it is of course live when passed to id while bound in id's argument -- but if that is the *only* binding, then the object is *guaranteed* to be garbage when id returns the integer, and thus that integer is IMO meaningless except as a debugging peek at implementation, and it would be an *error* for a program to depend on its value. [10:12 ~]$ python -c 'import this'|grep -A1 Errors Errors should never pass silently. Unless explicitly silenced. You are right that existing code could and some probably would break if id guarantees validity of the integer by holding the object, so I will go with the first alternative I mentioned in my reply to Armin, and focus on preventing return of the id of garbage rather than the "or else..." option which is impractical and is likely to break code, as you say. <excerpt pasted as quote>
Letting the expression result die and returning a kind of pointer to where the result object *was* seems like a dangling pointer problem, except I guess you can't dereference an id value (without hackery).
Maybe id should raise an exception if the argument referenced only has a ref count of 1 (i.e., just the reference from the argument list)?
Or else let id be a class and return a minimal instance only binding the passed object, and customize the compare ops to take into account type diffs etc.? Then there would be no id values without corresponding objects, and id values used in expressions would die a natural death, along with their references to their objects -- whether "variables" or expressions.
Sorry to belabor the obvious ;-) </excerpt>
Rather than exception, perhaps returning a None would suffice, analogous to a null pointer where no valid pointer can be returned. That should be cheap. It could also be used in answer to Laura's question, to which I only proposed the impractical id object.
I know that you are suggesting that id returns something that is /not/ an integer, but that is also a language change. People have always been able to assume that they can % format ids as decimals or hexadecimals.
I thought of subclassing int, but was reaching for an id abstraction more than a practical thing, sorry. But never mind to id-as-object idea for current python ;-)
Or do you mean that id's must be allowed to be compared == to integers, which my example prohibits? (I didn't define __cmp__, BTW, just lazy ;-)
Good, __cmp__ has been deprecated for over 10 years now.
The only sensible sort on id's I can think of off hand would be if id's carried a time stamp.
If you want an object reference, just use one. If you want them to be persistent, build a dictionary from id to object.
Yes, dictionary is one way to bind an object and thus make sure its id is valid.
But it would be overkill to use a dictionary to guarantee object id persistence just for the duration of an expression such as id(x.a) == id(y.a)
But id is not about persistence. The lack of persistence is one of its key features.
That said, I do think id()'s current behaviour is overkill. I just don't think we can change it in a way that will fit existing usage. And cleaning it up properly is far too much work.
How about just returning None when id sees an object which no other code will be able to see when id returns (hence making the integer the id of garbage)? <snip>
The definition of id(), according to docs.python.org, is:
Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
Also, a new id could live alongside the old ;-)
It's just that the problems you are attempting to fix are already solved, and they are only vaguely related to what a python programmer understands id() to mean. If, according to cpython, "1003 is not 1000 + 3", then programmers can't rely on any excellent new behaviour for id() *anyway*. My question to Armin was whether doing what cpython 2.7 does meant following
Hm, I couldn't find that, googling <a few strings from the above> site:python.org Nor at site:docs.python.org. Maybe from a non-current version of docs? But never mind. the vagaries of possible optimizations. E.g., if space for constants were slightly modified, cpython would return False for "1003 is not 1000 +3". 1000+3 is apparently already folded to a constant 1003, but apparently local constants are currently allowed to be duplicated, as you see in in the disassembly of your example:
from ut.miscutil import disev 1003 is not 1000 + 3 True disev("1003 is not 1000 + 3") 1 0 LOAD_CONST 0 (1003) 3 LOAD_CONST 3 (1003) 6 COMPARE_OP 9 (is not) 9 RETURN_VALUE
It would seem you could generate quite a few equivalent constants:
disev('[1000+3,1000+3,1000+3,1000+3,1000+3]') 1 0 LOAD_CONST 2 (1003) 3 LOAD_CONST 3 (1003) 6 LOAD_CONST 4 (1003) 9 LOAD_CONST 5 (1003) 12 LOAD_CONST 6 (1003) 15 BUILD_LIST 5 18 RETURN_VALUE which sooner or later someone will probably find a reason to optimize for space, and what does that mean for the *"language"* definition of id?
OTOH, the "identity may not even be preserved for primitive types" issue is an observable difference to cpython and is fixable, even if it is a silly thing to rely on.
Apparently the folding of expressions yielding e.g. small integers involves generating a reference to the single instance. Hm. I downloaded pypy and it does optimize constant storage for 1003 is 1000+3 [11:03 ~]$ pypy pypy: /usr/lib/libcrypto.so.0.9.8: no version information available (required by pypy) pypy: /usr/lib/libssl.so.0.9.8: no version information available (required by pypy) Python 2.7.1 (b590cf6de419, Apr 30 2011, 02:00:38) [PyPy 1.5.0-alpha0 with GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. And now for something completely different: ``psyco eats one brain per inch of progress''
1003 is 1000+3
True
from ut.miscutil import disev disev('1003 is 1000+3') 1 0 LOAD_CONST 0 (1003) 3 LOAD_CONST 0 (1003) 6 COMPARE_OP 8 (is) 9 RETURN_VALUE
Let's see what the id values are:
id(1003), id(1000+3) (-1216202084, -1216202084) disev('id(1003), id(1000+3)') 1 0 LOAD_NAME 0 (id) 3 LOAD_CONST 0 (1003) 6 CALL_FUNCTION 1 9 LOAD_NAME 0 (id) 12 LOAD_CONST 0 (1003) 15 CALL_FUNCTION 1 18 BUILD_TUPLE 2 21 RETURN_VALUE
Vs cpython 2.7.2:
id(1003), id(1000+3) # different garbage ;-) (136814932, 136814848) disev('id(1003), id(1000+3) # different garbage ;-)') 1 0 LOAD_NAME 0 (id) 3 LOAD_CONST 0 (1003) 6 CALL_FUNCTION 1 9 LOAD_NAME 0 (id) 12 LOAD_CONST 3 (1003) 15 CALL_FUNCTION 1 18 BUILD_TUPLE 2 21 RETURN_VALUE
Of course, the id's are all still id's of garbage locations once returned from id ;-) So how about returning None instead of id's of garbage, or raising an exception? Would that not be pythonic? Regards, Bengt Richter
On 07/13/2011 11:37 AM Bengt Richter wrote: <snip>...</snip>
Hm. I downloaded pypy and it does optimize constant storage for 1003 is 1000+3
<snip>...</snip>
Of course, the id's are all still id's of garbage locations once returned from id ;-)
So how about returning None instead of id's of garbage, or raising an exception? Would that not be pythonic?
Hm, other than practicality beating purity ;-/ Sorry to be commenting on myself, but a further thought: In a way, a constant could be considered a specially-named immutable variable, e.g., "1003" "names" 1003, so one could consider the id of an arbitrary constant (even if perhaps only "named" and its value referenced in bytecode due to constant folding from source expressions) to be the id of a live object. But I still think there will be examples of id arguments that will turn to garbage as soon as id returns -- and renders the id erroneous for any use besides a debugging peek at memory usage. I.e., there will be temp objects with no persistence beyond the scope of the argument use within id, other than while ahead in a race with garbage collection. Is there an easy way to check on whether the argument *only* has the one reference from id's arg list? When is a ref count not available? Do simple atomic constants have them? E.g., small integers vs bigger? And True, False, None? ()? and []? Regards, Bengt Richter
participants (3)
-
Bengt Richter
-
Laura Creighton
-
William ML Leslie