[pypy-dev] Object identity and dict strategies

Sat Jul 9 05:20:53 CEST 2011

On 8 July 2011 19:52, Armin Rigo <arigo at tunes.org> wrote:
> Hi William,

Hi Armin, everybody,

> On Fri, Jul 8, 2011 at 10:31 AM, William ML Leslie
> <william.leslie.ttg at gmail.com> wrote:
>> On another note: what Alex talks about as being two different cases
>> are just one with the small int optimisation - all references can be
>> compared by value in the C backend with small ints enabled, if the
>> object space doesn't provide alternative behaviour.
>
> No, because e.g. of longs.  You don't have the same issue if you use
> longs instead of ints in this particular example, but more generally
> the issue exists too.
>
> A first note is that it's impossible to satisfy all three of Alex's
> criteria in general: we would like id(x) to be a unique word-sized
> number determined only by the value of 'x', and different for every
> value of 'x' and for every object of a different type too; but if the
> possible 'x'es are all possible word-sized integers, then it's
> impossible to satisfy this, just because there are too many of them.
> The problem only gets "more impossible" if we include all long objects
> in 'x'.

That id(x) be a word-sized value uniquely determined by the value of x
is impossible, yes, as the following program should demonstrate:

while True:
    id([])

We create an infinite number of objects here, and if each one had to
have a unique word-sized id, we'd exhaust that space pretty quickly.

What id() does provide is that as long as there is a *reference* to
the object, it returns a constant and unique integer (which we are
assuming should be a word-sized integer).

As later emails on this list suggested, it would be better if the
semantics were relaxed to "the id should be preserved", meaning that
placement of an integer or string into a container should allow it to
have the same id on retrieval.  New objects resulting from operations
such as multiplication or string concatenation need not have the same
id as other objects with the same value - if we did that for strings,
it would have serious consequences for parallel code.

What I was suggesting is that, since every live object must be encoded
in an object reference somewhere, that object references should be at
least good enough to suggest an id.  My point about small integers
wasn't really about word-sized ints, I was talking about the smallint
optimisation, which as I understood it, boxes small app-level integers
in the C backend.  It does this by shifting integers right and
bitwise-oring with 1; encoding the integer into the reference.  "By
definition", as Carl put it, there are never more objects represented
in this way than we can fit in a reasonable, bounded id size.

The suggestion then is to use the value of the object reference cast
as a word-sized integer as the id of the object for integers so
encoded, and calculate the id of other objects in the usual fashion.
This happens to work for larger integers and floats, because the id
will be preserved as long as a reference to them exists by their
boxedness.

Floats could use a similar mechanism to integers, eg, their bit
representation shifted left two and bitwise-ored with 2.  That does
mean that id() is no longer word-sized, but it does not make it
unbounded.

> The problem is not new, it is just a bit more apparent.  For example,
> already in pypy 1.5 we have:
>
>>>>> a = A()
>>>>> d = a.__dict__
>>>>> s = 'foobar'
>>>>> d[s] = 5
>>>>> id(s)
> 163588812
>>>>> id(d.keys()[0])
> 163609508
>>>>> id(d.keys()[0])
> 163609520

What is keeping us from using the underlying rpython string as the
basis for id?  I guess there is nowhere obvious to store the id or the
need to store the id in the gc copy phase.  In that way, it'd make for
an ugly special case.

On 9 July 2011 00:07, Armin Rigo <arigo at tunes.org> wrote:
> After some discussion with Carl Friedrich, here is the best we could
> think of.  Say that "a is b" is redefined as being True for two equal
> immutable objects of the same type.  Then we want the following
> property to always hold: "a is b" if and only if "id(a) == id(b)".

I would prefer to weaken this just a little:

x is y iff id(x) == id(y) WHEN x and y are live for the duration of
the equality.  This counters cases such as:

id([]) == id([])

which can certainly happen under any reasonable definition of id.

>  We
> can do that by having id(x) return either a regular integer or a long.
>  Let's call for the rest of this discussion an "immutable object" an
> object of type int, long or float.  If 'x' is not an immutable object,
> then id(x) can return the same value as it does now.  If 'x' is an
> immutable object, then we compute from it a long value that does *not*
> fit in 32- or 64-bit, and that includes some tagging to make sure that
> immutable objects of different types have different id's.  For
> example, id(7) would be (2**32 + 7<<3 + 1).

This makes the range of id() unbounded, where its domain is bounded.
I don't feel comfortable with it, although I know that isn't much of an
objection.

> Such a solution should make two common use cases of id() work:
>
> 1. as keys in a dictionary, to implement an identity dict, like in
> copy.py: in this case we take the id() of random objects including
> immutable ones, but only expect the result to work as keys in the
> dictionary.  Getting arbitrarily-sized longs is not a problem here.

Speaking of, maybe it'd be easier to attempt to get the identity dict
into the language proper.

William Leslie