[pypy-dev] Object identity and dict strategies

Bengt Richter bokr at oz.net
Wed Jul 13 11:37:01 CEST 2011


On 07/11/2011 04:21 PM William ML Leslie wrote:
> On 11 July 2011 23:21, Bengt Richter<bokr at oz.net>  wrote:
>> On 07/11/2011 01:36 PM William ML Leslie wrote:
>>>
>>> On 11 July 2011 20:29, Bengt Richter<bokr at oz.net>    wrote:
>>>>
>>>> On 07/10/2011 09:13 PM Laura Creighton wrote:
>>>>>
>>>>> What do we want to happen when somebody -- say in a C extension -- takes
>>>>> the id of an object
>>>>> that is scheduled to be removed when the gc next runs?
>>>>
>>>> IMO taking the id should increment the object ref counter
>>>> and prevent the garbage collection, until the id value itself is garbage
>>>> collected.
>>>
>>> This significantly changes the meaning of id() in a way that will
>>> break existing code.
>>>
>> Do you have an example of existing code that depends on the integer-cast
>> value of a dangling pointer??
>
> I mean that id creating a reference will break existing code.  id()
> has always returned an integer, and the existence of some integer in
> some python code has never prevented some otherwise unrelated object
> from being collected.  Existing code will not make sure that it cleans
> up the return value of id(), as nowhere has id() ever kept a reference
> to the object passed in.
Ok, d'oh ;-/

I was focused on making sure the id value "referred" to an existing live object
*when returned from id* (it is of course live when passed to id while bound in
id's argument -- but if that is the *only* binding, then the object is *guaranteed*
to be garbage when id returns the integer, and thus that integer is IMO meaningless
except as a debugging peek at implementation, and it would be an *error* for a program
to depend on its value.

[10:12 ~]$ python -c 'import this'|grep -A1 Errors
Errors should never pass silently.
Unless explicitly silenced.

You are right that existing code could and some probably would break if id guarantees
validity of the integer by holding the object, so I will go with the first alternative
I mentioned in my reply to Armin, and focus on preventing return of the id of garbage
rather than the "or else..." option which is impractical and is likely to break code, as you say.

<excerpt pasted as quote>
> Letting the expression result die and returning a kind of pointer
> to where the result object *was* seems like a dangling pointer problem,
> except I guess you can't dereference an id value (without hackery).
>
> Maybe id should raise an exception if the argument referenced only has
> a ref count of 1 (i.e., just the reference from the argument list)?
>
> Or else let id be a class and return a minimal instance only binding
> the passed object, and customize the compare ops to take into account
> type diffs etc.? Then there would be no id values without corresponding
> objects, and id values used in expressions would die a natural death,
> along with their references to their objects -- whether "variables"
> or expressions.
>
> Sorry to belabor the obvious ;-)
</excerpt>

Rather than exception, perhaps returning a None would suffice, analogous
to a null pointer where no valid pointer can be returned. That should be cheap.

It could also be used in answer to Laura's question, to which I only proposed
the impractical id object.

>
> I know that you are suggesting that id returns something that is /not/
> an integer, but that is also a language change.  People have always
> been able to assume that they can % format ids as decimals or
> hexadecimals.
I thought of subclassing int, but was reaching for an id abstraction more
than a practical thing, sorry. But never mind to id-as-object idea for current python ;-)
>
>> Or do you mean that id's must be allowed to be compared == to integers,
>> which my example prohibits? (I didn't define __cmp__, BTW, just lazy ;-)
>
> Good, __cmp__ has been deprecated for over 10 years now.
>
The only sensible sort on id's I can think of off hand would be if id's carried
a time stamp.

>>> If you want an object reference, just use one.  If you want them to be
>>> persistent, build a dictionary from id to object.
>>
>> Yes, dictionary is one way to bind an object and thus make sure its id is
>> valid.
>>
>> But it would be overkill to use a dictionary to guarantee object id
>> persistence
>> just for the duration of an expression such as id(x.a) == id(y.a)
>
> But id is not about persistence. The lack of persistence is one of its
> key features.
>
> That said, I do think id()'s current behaviour is overkill.  I just
> don't think we can change it in a way that will fit existing usage.
> And cleaning it up properly is far too much work.
>
How about just returning None when id sees an object which no other
code will be able to see when id returns (hence making the integer
the id of garbage)?

<snip>
>
> The definition of id(), according to docs.python.org, is:
>
> Return the “identity” of an object. This is an integer (or long
> integer) which is guaranteed to be unique and constant for this object
> during its lifetime. Two objects with non-overlapping lifetimes may
> have the same id() value.
>
Hm, I couldn't find that, googling <a few strings from the above> site:python.org
Nor at site:docs.python.org. Maybe from a non-current version of docs? But never mind.

>> Also, a new id could live alongside the old ;-)
>
> It's just that the problems you are attempting to fix are already
> solved, and they are only vaguely related to what a python programmer
> understands id() to mean.  If, according to cpython, "1003 is not 1000
> + 3", then programmers can't rely on any excellent new behaviour for
> id() *anyway*.
My question to Armin was whether doing what cpython 2.7 does meant following
the vagaries of possible optimizations. E.g., if space for constants were
slightly modified, cpython would return False for "1003 is not 1000 +3".
1000+3 is apparently already folded to a constant 1003, but apparently
local constants are currently allowed to be duplicated, as you see in
in the disassembly of your example:

 >>> from ut.miscutil import disev
 >>> 1003 is not 1000 + 3
True
 >>> disev("1003 is not 1000 + 3")
   1           0 LOAD_CONST               0 (1003)
               3 LOAD_CONST               3 (1003)
               6 COMPARE_OP               9 (is not)
               9 RETURN_VALUE

It would seem you could generate quite a few equivalent constants:
 >>> disev('[1000+3,1000+3,1000+3,1000+3,1000+3]')
   1           0 LOAD_CONST               2 (1003)
               3 LOAD_CONST               3 (1003)
               6 LOAD_CONST               4 (1003)
               9 LOAD_CONST               5 (1003)
              12 LOAD_CONST               6 (1003)
              15 BUILD_LIST               5
              18 RETURN_VALUE
which sooner or later someone will probably find a reason to optimize for space,
and what does that mean for the *"language"* definition of id?

>
> OTOH, the "identity may not even be preserved for primitive types"
> issue is an observable difference to cpython and is fixable, even if
> it is a silly thing to rely on.
>
Apparently the folding of expressions yielding e.g. small integers involves
generating a reference to the single instance.

Hm. I downloaded pypy and it does optimize constant storage for 1003 is 1000+3

[11:03 ~]$ pypy

pypy: /usr/lib/libcrypto.so.0.9.8: no version information available (required by pypy)
pypy: /usr/lib/libssl.so.0.9.8: no version information available (required by pypy)
Python 2.7.1 (b590cf6de419, Apr 30 2011, 02:00:38)
[PyPy 1.5.0-alpha0 with GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``psyco eats one brain per inch of
progress''
 >>>>
 >>>> 1003 is 1000+3
True
 >>>> from ut.miscutil import disev
 >>>> disev('1003 is 1000+3')
   1           0 LOAD_CONST               0 (1003)
               3 LOAD_CONST               0 (1003)
               6 COMPARE_OP               8 (is)
               9 RETURN_VALUE
 >>>>

Let's see what the id values are:

 >>>> id(1003), id(1000+3)
(-1216202084, -1216202084)
 >>>> disev('id(1003), id(1000+3)')
   1           0 LOAD_NAME                0 (id)
               3 LOAD_CONST               0 (1003)
               6 CALL_FUNCTION            1
               9 LOAD_NAME                0 (id)
              12 LOAD_CONST               0 (1003)
              15 CALL_FUNCTION            1
              18 BUILD_TUPLE              2
              21 RETURN_VALUE
 >>>>

Vs cpython 2.7.2:

 >>> id(1003), id(1000+3)  # different garbage ;-)
(136814932, 136814848)
 >>> disev('id(1003), id(1000+3)  # different garbage ;-)')
   1           0 LOAD_NAME                0 (id)
               3 LOAD_CONST               0 (1003)
               6 CALL_FUNCTION            1
               9 LOAD_NAME                0 (id)
              12 LOAD_CONST               3 (1003)
              15 CALL_FUNCTION            1
              18 BUILD_TUPLE              2
              21 RETURN_VALUE
 >>>

Of course, the id's are all still id's of garbage locations
once returned from id ;-)

So how about returning None instead of id's of garbage,
or raising an exception? Would that not be pythonic?

Regards,
Bengt Richter



More information about the pypy-dev mailing list