[pypy-dev] Object identity and dict strategies

Mon Jul 11 16:21:15 CEST 2011

On 11 July 2011 23:21, Bengt Richter <bokr at oz.net> wrote:
> On 07/11/2011 01:36 PM William ML Leslie wrote:
>>
>> On 11 July 2011 20:29, Bengt Richter<bokr at oz.net>  wrote:
>>>
>>> On 07/10/2011 09:13 PM Laura Creighton wrote:
>>>>
>>>> What do we want to happen when somebody -- say in a C extension -- takes
>>>> the id of an object
>>>> that is scheduled to be removed when the gc next runs?
>>>
>>> IMO taking the id should increment the object ref counter
>>> and prevent the garbage collection, until the id value itself is garbage
>>> collected.
>>
>> This significantly changes the meaning of id() in a way that will
>> break existing code.
>>
> Do you have an example of existing code that depends on the integer-cast
> value of a dangling pointer??

I mean that id creating a reference will break existing code.  id()
has always returned an integer, and the existence of some integer in
some python code has never prevented some otherwise unrelated object
from being collected.  Existing code will not make sure that it cleans
up the return value of id(), as nowhere has id() ever kept a reference
to the object passed in.

I know that you are suggesting that id returns something that is /not/
an integer, but that is also a language change.  People have always
been able to assume that they can % format ids as decimals or
hexadecimals.

> Or do you mean that id's must be allowed to be compared == to integers,
> which my example prohibits? (I didn't define __cmp__, BTW, just lazy ;-)

Good, __cmp__ has been deprecated for over 10 years now.

>> If you want an object reference, just use one.  If you want them to be
>> persistent, build a dictionary from id to object.
>
> Yes, dictionary is one way to bind an object and thus make sure its id is
> valid.
>
> But it would be overkill to use a dictionary to guarantee object id
> persistence
> just for the duration of an expression such as id(x.a) == id(y.a)

But id is not about persistence. The lack of persistence is one of its
key features.

That said, I do think id()'s current behaviour is overkill.  I just
don't think we can change it in a way that will fit existing usage.
And cleaning it up properly is far too much work.

>> You can already do
>> this yourself in pure python, and it doesn't have the side-effect of
>> bloating id().
>
> My examples *are* in pure python ;-)

As is copy.py.  We've seen several examples on this thread where you
can build additional features on top of what id() gives you without
changing id().  So we've no need to break id() in any of the ways that
have been suggested here.

>> Otherwise, such a suggestion should go through the usual process for
>> such a significant change to a language primitive.
>>
> Sure, but I only really want to understand the real (well, *intended* ;-)
> meaning of the id function, so I am putting forth illustrative examples
> to identify aspects of its current and possible behavior.

The notion of identity is important in any stateful language.
Referential equivalence, which is a slightly more complicated (yet
much better defined) idea says that x and y are equivalent when no
operation can tell the difference between the two objects.  'is' is an
approximation that is at least accurate for mutability of python
objects.  In order for x to "be" y,  assignments like x.__class__ =
Foo must have exactly the same effect as y.__class__ = Foo.  You could
presumably write a type in the implementation language that was in no
way discernable from the real x, but if x is y, you *know* there is no
difference.

What id() does is it attempts to distil 'the thing compared' when 'is'
is used.  On cpython, it just returned the integer value of the
pointer to the object, because on cpython that is cheap and does the
job (and hey, it *is* the thing compared when you do 'is' on cpython).
On pypy, things are slightly more complicated.  Pypy is written in
python, which has no concept of pointers.  It translates to the JVM
and the (safe) CLI, neither of which have a direct analogue of the
pointer.  And even when C or LLVM is used as the backend, objects may
move around in memory.  Having id() return different values after a
collection cycle would be very confusing.  So, pypy implements its
own, quite clever mechanism for creating ids.  It is described in a
blog post, if you'd like to read it.

The definition of id(), according to docs.python.org, is:

Return the “identity” of an object. This is an integer (or long
integer) which is guaranteed to be unique and constant for this object
during its lifetime. Two objects with non-overlapping lifetimes may
have the same id() value.

> Also, a new id could live alongside the old ;-)

It's just that the problems you are attempting to fix are already
solved, and they are only vaguely related to what a python programmer
understands id() to mean.  If, according to cpython, "1003 is not 1000
+ 3", then programmers can't rely on any excellent new behaviour for
id() *anyway*.

OTOH, the "identity may not even be preserved for primitive types"
issue is an observable difference to cpython and is fixable, even if
it is a silly thing to rely on.

-- 
William Leslie