
Hi all, I've now spoken with several developers about the object identity that arises out of hte new dict strategies, and all seem to think that the current implementation breaks Python's semantics, that is: x = 3 z = {x: None} assert next(iter(z)) is x will fail. The only solution I see to this that maintains the correct semantics in all cases is to specialize is/id() for primitive types. That is to say, all integers of any given value have the same id() and x is y is true. Doing x is y is true is easy, you simply have a multimethod which dispatches on the types and compares intval for W_IntObjects (objects of differing types can never have the same identity) however the question is what to use for id() that is unique, never changes for an int, and doesn't intersect with any other live object. In particular the last constraint is the most difficult I think. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

On Fri, Jul 8, 2011 at 1:49 AM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
Hi all, I've now spoken with several developers about the object identity that arises out of hte new dict strategies, and all seem to think that the current implementation breaks Python's semantics, that is: x = 3 z = {x: None} assert next(iter(z)) is x will fail. The only solution I see to this that maintains the correct semantics in all cases is to specialize is/id() for primitive types. That is to say, all integers of any given value have the same id() and x is y is true. Doing x is y is true is easy, you simply have a multimethod which dispatches on the types and compares intval for W_IntObjects (objects of differing types can never have the same identity) however the question is what to use for id() that is unique, never changes for an int, and doesn't intersect with any other live object. In particular the last constraint is the most difficult I think. Alex
x = 3 x is 3 True x = 1003 x is 1003 False
Stop relying on obscure details I think

On Fri, Jul 08, 2011 at 09:47 +0200, Maciej Fijalkowski wrote:
On Fri, Jul 8, 2011 at 1:49 AM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
Hi all, I've now spoken with several developers about the object identity that arises out of hte new dict strategies, and all seem to think that the current implementation breaks Python's semantics, that is: x = 3 z = {x: None} assert next(iter(z)) is x will fail. The only solution I see to this that maintains the correct semantics in all cases is to specialize is/id() for primitive types. That is to say, all integers of any given value have the same id() and x is y is true. Doing x is y is true is easy, you simply have a multimethod which dispatches on the types and compares intval for W_IntObjects (objects of differing types can never have the same identity) however the question is what to use for id() that is unique, never changes for an int, and doesn't intersect with any other live object. In particular the last constraint is the most difficult I think. Alex
x = 3 x is 3 True x = 1003 x is 1003 False
I guess this is not what Alex means here. In CPython, PyPy (both 1.5 and trunk) you can do: >>> class A: pass >>> a=A() >>> next(iter({a: None})) is a True but not with ints on pypy-trunk >>> a=3 >>> next(iter({a: None})) is a False (it's true on pypy-1.5). IOW, i think the issue here is that iterating over keys of a dict usually gives the exact same ("is") objects in CPython whereas pypy trunk does not provide that at least for ints. best, holger
Stop relying on obscure details I think _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

On 8 July 2011 17:58, holger krekel <holger@merlinux.eu> wrote:
IOW, i think the issue here is that iterating over keys of a dict usually gives the exact same ("is") objects in CPython whereas pypy trunk does not provide that at least for ints.
I couldn't find anything precise in the official documentation on the meaning of 'is'. I think that the general understanding is that it makes no sense whatsoever on immutable objects (as in, it isn't guaranteed to do so). Consequently, a python implementation could also cache tuples. Re-using tuples might sound unusual, but there are special cases that start to sound reasonable, such as caching the empty tuple, or copy-propogating a tuple unpack & repack. The language spec is very light on what is allowed to be a 'different object', and what is *suggested* by cpython's int caching behaviour is that the behaviour of 'is' for language-provided immutable objects can't be relied upon in any way, shape or form. Pypy hasn't matched cpython's behaviour with ints here in a long time, so it obviously doesn't matter. On another note: what Alex talks about as being two different cases are just one with the small int optimisation - all references can be compared by value in the C backend with small ints enabled, if the object space doesn't provide alternative behaviour. -- William Leslie

Hi William, On Fri, Jul 8, 2011 at 10:31 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
On another note: what Alex talks about as being two different cases are just one with the small int optimisation - all references can be compared by value in the C backend with small ints enabled, if the object space doesn't provide alternative behaviour.
No, because e.g. of longs. You don't have the same issue if you use longs instead of ints in this particular example, but more generally the issue exists too. A first note is that it's impossible to satisfy all three of Alex's criteria in general: we would like id(x) to be a unique word-sized number determined only by the value of 'x', and different for every value of 'x' and for every object of a different type too; but if the possible 'x'es are all possible word-sized integers, then it's impossible to satisfy this, just because there are too many of them. The problem only gets "more impossible" if we include all long objects in 'x'. The problem is not new, it is just a bit more apparent. For example, already in pypy 1.5 we have:
a = A() d = a.__dict__ s = 'foobar' d[s] = 5 id(s) 163588812 id(d.keys()[0]) 163609508 id(d.keys()[0]) 163609520
I thought that there are also issues that would only show up with the JIT, because of the _immutable_ flag on W_IntObject, but it seems that I'm wrong. I can only say that Psyco has such issues, but nobody complained about them: lst = [] def f(x): for _ in range(3): pass # prevents inlining lst.append(x) def g(n): for i in range(n): f(i); f(i) With Psyco a call to g(5000) puts in 'lst' 10000 integer objects that are mostly all distinct objects, although there should in theory be at most 5000 distinct objects in there. (PyPy is safe so far because the call_assembler from g() to f() passes fully built W_IntObjects, instead of just cpu-level words; but that may change if in the future we add an optimization that knows how to generate a more efficient call_assembler.) I suppose that it's again a mixture of rules that are too vague and complains "but it works on CPython!" that are now being voiced just because the already-existing difference just became a bit more apparent. Sorry for the rant, I don't have an obvious solution :-) A bientôt, Armin.

2011/7/8 Armin Rigo <arigo@tunes.org>
Hi William,
On Fri, Jul 8, 2011 at 10:31 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
On another note: what Alex talks about as being two different cases are just one with the small int optimisation - all references can be compared by value in the C backend with small ints enabled, if the object space doesn't provide alternative behaviour.
No, because e.g. of longs. You don't have the same issue if you use longs instead of ints in this particular example, but more generally the issue exists too.
A first note is that it's impossible to satisfy all three of Alex's criteria in general: we would like id(x) to be a unique word-sized number determined only by the value of 'x', and different for every value of 'x' and for every object of a different type too; but if the possible 'x'es are all possible word-sized integers, then it's impossible to satisfy this, just because there are too many of them. The problem only gets "more impossible" if we include all long objects in 'x'.
The problem is not new, it is just a bit more apparent. For example, already in pypy 1.5 we have:
a = A() d = a.__dict__ s = 'foobar' d[s] = 5 id(s) 163588812 id(d.keys()[0]) 163609508 id(d.keys()[0]) 163609520
I thought that there are also issues that would only show up with the JIT, because of the _immutable_ flag on W_IntObject, but it seems that I'm wrong. I can only say that Psyco has such issues, but nobody complained about them:
lst = [] def f(x): for _ in range(3): pass # prevents inlining lst.append(x) def g(n): for i in range(n): f(i); f(i)
With Psyco a call to g(5000) puts in 'lst' 10000 integer objects that are mostly all distinct objects, although there should in theory be at most 5000 distinct objects in there. (PyPy is safe so far because the call_assembler from g() to f() passes fully built W_IntObjects, instead of just cpu-level words; but that may change if in the future we add an optimization that knows how to generate a more efficient call_assembler.)
I suppose that it's again a mixture of rules that are too vague and complains "but it works on CPython!" that are now being voiced just because the already-existing difference just became a bit more apparent. Sorry for the rant, I don't have an obvious solution :-)
A bientôt,
Armin.
Hi Armin I fully agree. It's not an issue, but an implementation-specific detail which programmers don't have to assume always true. CPython can be compiled without "smallints" (-5..256, if I remember correctly) caching. There's a #DEFINE that can be disabled, so EVERY int (or long) will be allocated, so using the is operator will return False most of the time (unless you are just copied exactly the same object). The same applies for 1 character strings, which are USUALLY cached by CPython. So, there must be care about using is. It's safe for some trivial objects (None, False, True, Ellipsis) and, I think, with user-defined classes' instances, but not for everything. Regards, Cesare

Hi all, On Fri, Jul 8, 2011 at 12:04 PM, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:
So, there must be care about using is. It's safe for some trivial objects (None, False, True, Ellipsis) and, I think, with user-defined classes' instances, but not for everything.
The problem is more acute with id(). Some standard library modules like copy.py *need* a working id() for any object, including immutable ones, because CPython has no identity dict. After some discussion with Carl Friedrich, here is the best we could think of. Say that "a is b" is redefined as being True for two equal immutable objects of the same type. Then we want the following property to always hold: "a is b" if and only if "id(a) == id(b)". We can do that by having id(x) return either a regular integer or a long. Let's call for the rest of this discussion an "immutable object" an object of type int, long or float. If 'x' is not an immutable object, then id(x) can return the same value as it does now. If 'x' is an immutable object, then we compute from it a long value that does *not* fit in 32- or 64-bit, and that includes some tagging to make sure that immutable objects of different types have different id's. For example, id(7) would be (2**32 + 7<<3 + 1). Such a solution should make two common use cases of id() work: 1. as keys in a dictionary, to implement an identity dict, like in copy.py: in this case we take the id() of random objects including immutable ones, but only expect the result to work as keys in the dictionary. Getting arbitrarily-sized longs is not a problem here. 2. more contrived examples involve taking the id() of some instance, sending it around as a (word-sized) integer, and when getting it back retrieving the original instance from some dict. I don't expect people to do that with immutable objects, only their own custom instances. That's why it should be enough if id(x) returns a regular, 32- or 64-bit integer for *non-immutable* objects. A bientôt, Armin.

2011/7/8 Cesare Di Mauro <cesare.di.mauro@gmail.com>:
I fully agree. It's not an issue, but an implementation-specific detail which programmers don't have to assume always true.
CPython can be compiled without "smallints" (-5..256, if I remember correctly) caching. There's a #DEFINE that can be disabled, so EVERY int (or long) will be allocated, so using the is operator will return False most of the time (unless you are just copied exactly the same object).
The same applies for 1 character strings, which are USUALLY cached by CPython.
But the problem here is not object cache, but preservation of object identity, which is quite different. Python containers are supposed to keep the objects you put inside: myList.append(x) assert myList(-1) is x myDict[x] = 1 for key in myDict: if key is x: ... -- Amaury Forgeot d'Arc

On Fri, Jul 8, 2011 at 4:14 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:
2011/7/8 Cesare Di Mauro <cesare.di.mauro@gmail.com>:
I fully agree. It's not an issue, but an implementation-specific detail which programmers don't have to assume always true.
CPython can be compiled without "smallints" (-5..256, if I remember correctly) caching. There's a #DEFINE that can be disabled, so EVERY int (or long) will be allocated, so using the is operator will return False most of the time (unless you are just copied exactly the same object).
The same applies for 1 character strings, which are USUALLY cached by CPython.
But the problem here is not object cache, but preservation of object identity, which is quite different. Python containers are supposed to keep the objects you put inside:
[citation needed] array.array does not for one
myList.append(x) assert myList(-1) is x
myDict[x] = 1 for key in myDict: if key is x: ...
also dict doesn't work if you overwrite the key: d = {1003: None} x = 1003 d[x] = None d.keys()[0] is x

On 02:17 pm, fijall@gmail.com wrote:
On Fri, Jul 8, 2011 at 4:14 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:
2011/7/8 Cesare Di Mauro <cesare.di.mauro@gmail.com>:
I fully agree. It's not an issue, but an implementation-specific detail which programmers don't have to assume always true.
CPython can be compiled without "smallints" (-5..256, if I remember correctly) caching. There's a #DEFINE that can be disabled, so EVERY int (or long) will be allocated, so using the is operator will return False most of the time (unless you are just copied exactly the same object).
The same applies for 1 character strings, which are USUALLY cached by CPython.
But the problem here is not object cache, but preservation of object identity, which is quite different. Python containers are supposed to keep the objects you put inside:
[citation needed] array.array does not for one
Yes, and array.array is weird. :) It either exists as a memory optimization (ie, I don't want objects) or a way to directly lay out memory (to pass to a C API). Either way, you can't put arbitrary objects into it either - so it's already a little special, even if you disregard the fact that it doesn't preserve the identify the objects you can put into it. However, you're right. It exists, and it has this non-identity- preserving behavior. Is it a good thing, though? Or just an accident of how someone tried to let CPython be faster for some types of problems?
myList.append(x) assert myList(-1) is x
myDict[x] = 1 for key in myDict: � �if key is x: � � � �...
also dict doesn't work if you overwrite the key:
d = {1003: None} x = 1003 d[x] = None d.keys()[0] is x
This doesn't invalidate the original point, as far as I can tell. It just demonstrates again that you can have two instances of 1003. Whether dict guarantees to always use the new key or the old key when an update is made is a separate question. I think it would be better if object identity didn't depend on this mysterious quality of "immutability". The language is easier to understand (particularly for new programmers) if one can talk about objects and references without having to also explain that _some_ data types are represented using things that are sort of like objects but not quite (and worse if it depends on what types the JIT feels like playing with in any particular version of the interpreter). Jean-Paul

On 07/08/2011 04:14 PM, Amaury Forgeot d'Arc wrote:
2011/7/8 Cesare Di Mauro<cesare.di.mauro@gmail.com>:
I fully agree. It's not an issue, but an implementation-specific detail which programmers don't have to assume always true.
CPython can be compiled without "smallints" (-5..256, if I remember correctly) caching. There's a #DEFINE that can be disabled, so EVERY int (or long) will be allocated, so using the is operator will return False most of the time (unless you are just copied exactly the same object).
The same applies for 1 character strings, which are USUALLY cached by CPython.
But the problem here is not object cache, but preservation of object identity, which is quite different.
I think in the end id is the hard problem. Object identity can be fixed. We could say that "a is b" uses equality for primitive objects. However, then id needs to be fixed too, so that the following property holds: "a is b" if and only if "id(a) == id(b)" This is the harder part. Carl Friedrich

On 08/07/11 16:50, Carl Friedrich Bolz wrote:
I think in the end id is the hard problem. Object identity can be fixed. We could say that "a is b" uses equality for primitive objects.
what about starting to think about solutions only when we are sure that it's actually a problem? I don't expect any real world code to rely on this behavior (or, if it does, I'm up to call it broken :-))

Hi Anto, On Fri, Jul 8, 2011 at 5:10 PM, Antonio Cuni <anto.cuni@gmail.com> wrote:
what about starting to think about solutions only when we are sure that it's actually a problem?
I don't expect any real world code to rely on this behavior (or, if it does, I'm up to call it broken :-))
As I said in my previous e-mail, I think that e.g. copy.py relies on such behavior, and more generally any Python code that has to use id() to emulate an identity-dict --- as broken as that approach is, just because CPython thought that identity-dicts were unnecessary. It may not actually be a problem right now in copy.py, but still, I fear that it's a problem waiting to hurt us. A bientôt, Armin.

On 08/07/11 17:19, Armin Rigo wrote:
As I said in my previous e-mail, I think that e.g. copy.py relies on such behavior, and more generally any Python code that has to use id() to emulate an identity-dict --- as broken as that approach is, just because CPython thought that identity-dicts were unnecessary.
It may not actually be a problem right now in copy.py, but still, I fear that it's a problem waiting to hurt us.
I don't think that copy.py relies on this behavior (even after looking at the code, but I might wrong). My point is that we are not breaking id(); identity dicts emulated using id() continue to work as normal. The only assumption that we are breaking is this one, which has nothing to do with identity dicts or id(). d = {} d[x] = None assert d.keys()[0] is x As fijal points out, the semantics of this particular behavior is already half-broken, even in CPython. E.g., in the following example we never remove anything from the dictionary, nevertheless a key "disappears": x = 1003 d = {x: None} assert d.keys()[0] is x d[1000+3] = None assert d.keys()[0] is x # BOOM ciao, Anto

Hi Anto, On Fri, Jul 8, 2011 at 5:40 PM, Antonio Cuni <anto.cuni@gmail.com> wrote:
My point is that we are not breaking id(); identity dicts emulated using id() continue to work as normal.
Well, the argument goes like this: *if* we think that the problems like the one you describe are to be fixed, then we really need to hack at "is", and *then* we have a problem with id() as well, and we have to fix it as I described. You can also try to argue that nothing is broken so far, neither in "is" nor in id() nor your examples. I fear that we are going to end up seeing more and more cases where users rely on the current CPython behavior, particularly because we're going to expose such issues more and more over time as we add new optimizations. But I may be wrong and it may be enough to document it in cpython-differences.rst.
x = 1003 d = {x: None} assert d.keys()[0] is x d[1000+3] = None assert d.keys()[0] is x # BOOM
This is the wrong example: in this case, CPython guarantees that it will not crash. The semantics are not really half-broken, just not written out clearly: when you add an object to a dict, if there is another key already in there that compares equal, then the existing key is kept; it is not replaced by the new key. (Both CPython and PyPy agree to this rule.) A bientôt, Armin.

On 08/07/11 17:59, Armin Rigo wrote:
I fear that we are going to end up seeing more and more cases where users rely on the current CPython behavior, particularly because we're going to expose such issues more and more over time as we add new optimizations. But I may be wrong and it may be enough to document it in cpython-differences.rst.
yes, the whole point of my position is that I don't think there is many code around relying on this behavior. Thus, I propose to wait a bit and see how many people complain, before fixing what it might be a non-issue
x = 1003 d = {x: None} assert d.keys()[0] is x d[1000+3] = None assert d.keys()[0] is x # BOOM
This is the wrong example: in this case, CPython guarantees that it will not crash. The semantics are not really half-broken, just not written out clearly: when you add an object to a dict, if there is another key already in there that compares equal, then the existing key is kept; it is not replaced by the new key. (Both CPython and PyPy agree to this rule.)
ouch, I should write test for my emails :-)

On 8 July 2011 19:52, Armin Rigo <arigo@tunes.org> wrote:
Hi William,
Hi Armin, everybody,
On Fri, Jul 8, 2011 at 10:31 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
On another note: what Alex talks about as being two different cases are just one with the small int optimisation - all references can be compared by value in the C backend with small ints enabled, if the object space doesn't provide alternative behaviour.
No, because e.g. of longs. You don't have the same issue if you use longs instead of ints in this particular example, but more generally the issue exists too.
A first note is that it's impossible to satisfy all three of Alex's criteria in general: we would like id(x) to be a unique word-sized number determined only by the value of 'x', and different for every value of 'x' and for every object of a different type too; but if the possible 'x'es are all possible word-sized integers, then it's impossible to satisfy this, just because there are too many of them. The problem only gets "more impossible" if we include all long objects in 'x'.
That id(x) be a word-sized value uniquely determined by the value of x is impossible, yes, as the following program should demonstrate: while True: id([]) We create an infinite number of objects here, and if each one had to have a unique word-sized id, we'd exhaust that space pretty quickly. What id() does provide is that as long as there is a *reference* to the object, it returns a constant and unique integer (which we are assuming should be a word-sized integer). As later emails on this list suggested, it would be better if the semantics were relaxed to "the id should be preserved", meaning that placement of an integer or string into a container should allow it to have the same id on retrieval. New objects resulting from operations such as multiplication or string concatenation need not have the same id as other objects with the same value - if we did that for strings, it would have serious consequences for parallel code. What I was suggesting is that, since every live object must be encoded in an object reference somewhere, that object references should be at least good enough to suggest an id. My point about small integers wasn't really about word-sized ints, I was talking about the smallint optimisation, which as I understood it, boxes small app-level integers in the C backend. It does this by shifting integers right and bitwise-oring with 1; encoding the integer into the reference. "By definition", as Carl put it, there are never more objects represented in this way than we can fit in a reasonable, bounded id size. The suggestion then is to use the value of the object reference cast as a word-sized integer as the id of the object for integers so encoded, and calculate the id of other objects in the usual fashion. This happens to work for larger integers and floats, because the id will be preserved as long as a reference to them exists by their boxedness. Floats could use a similar mechanism to integers, eg, their bit representation shifted left two and bitwise-ored with 2. That does mean that id() is no longer word-sized, but it does not make it unbounded.
The problem is not new, it is just a bit more apparent. For example, already in pypy 1.5 we have:
a = A() d = a.__dict__ s = 'foobar' d[s] = 5 id(s) 163588812 id(d.keys()[0]) 163609508 id(d.keys()[0]) 163609520
What is keeping us from using the underlying rpython string as the basis for id? I guess there is nowhere obvious to store the id or the need to store the id in the gc copy phase. In that way, it'd make for an ugly special case. On 9 July 2011 00:07, Armin Rigo <arigo@tunes.org> wrote:
After some discussion with Carl Friedrich, here is the best we could think of. Say that "a is b" is redefined as being True for two equal immutable objects of the same type. Then we want the following property to always hold: "a is b" if and only if "id(a) == id(b)".
I would prefer to weaken this just a little: x is y iff id(x) == id(y) WHEN x and y are live for the duration of the equality. This counters cases such as: id([]) == id([]) which can certainly happen under any reasonable definition of id.
We can do that by having id(x) return either a regular integer or a long. Let's call for the rest of this discussion an "immutable object" an object of type int, long or float. If 'x' is not an immutable object, then id(x) can return the same value as it does now. If 'x' is an immutable object, then we compute from it a long value that does *not* fit in 32- or 64-bit, and that includes some tagging to make sure that immutable objects of different types have different id's. For example, id(7) would be (2**32 + 7<<3 + 1).
This makes the range of id() unbounded, where its domain is bounded. I don't feel comfortable with it, although I know that isn't much of an objection.
Such a solution should make two common use cases of id() work:
1. as keys in a dictionary, to implement an identity dict, like in copy.py: in this case we take the id() of random objects including immutable ones, but only expect the result to work as keys in the dictionary. Getting arbitrarily-sized longs is not a problem here.
Speaking of, maybe it'd be easier to attempt to get the identity dict into the language proper. William Leslie

Hi, On Sat, Jul 9, 2011 at 5:20 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
My point about small integers (...)
I think that your point about small integers is broken (even assuming that smallints are enabled by default, which is not the case). It means that we'd get an id() function which "behaves" as long as taking the id() of a 31-bit signed integer, and then doesn't behave as expected neither for full 32-bit integers, nor for longs, nor for floats.
This happens to work for larger integers and floats, because the id will be preserved as long as a reference to them exists by their boxedness. Floats could use a similar mechanism to integers, eg, their bit representation shifted left two and bitwise-ored with 2.
I don't understand these two sentences, because they seem to say the exact opposite of each other...
That does mean that id() is no longer word-sized, but it does not make it unbounded.
The "unbounded" part in my e-mail was about longs. Obviously if you are computing id(x) where x is in some finite set (like ints or floats), then the results are in some finite set too.
What is keeping us from using the underlying rpython string as the basis for id?
This is probably a good enough workaround for strings and unicodes.
Speaking of, maybe it'd be easier to attempt to get the identity dict into the language proper.
We tried at some point, but python-dev refused the idea. Maybe the idea has more chances for approval now that we can really show with performance numbers that it's a deep issue, as opposed to just wave our hands. Feel free to try again. In the meantime I've decided, at least myself, to stick with the approach that Python is whatever is in 2.7, and that we have to work around such issues instead of fixing them properly in the language. A bientôt, Armin.

On 07/09/2011 10:17 AM Armin Rigo wrote:
Hi,
On Sat, Jul 9, 2011 at 5:20 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
My point about small integers (...)
I think that your point about small integers is broken (even assuming that smallints are enabled by default, which is not the case). It means that we'd get an id() function which "behaves" as long as taking the id() of a 31-bit signed integer, and then doesn't behave as expected neither for full 32-bit integers, nor for longs, nor for floats.
This happens to work for larger integers and floats, because the id will be preserved as long as a reference to them exists by their boxedness. Floats could use a similar mechanism to integers, eg, their bit representation shifted left two and bitwise-ored with 2.
I don't understand these two sentences, because they seem to say the exact opposite of each other...
That does mean that id() is no longer word-sized, but it does not make it unbounded.
The "unbounded" part in my e-mail was about longs. Obviously if you are computing id(x) where x is in some finite set (like ints or floats), then the results are in some finite set too.
What is keeping us from using the underlying rpython string as the basis for id?
This is probably a good enough workaround for strings and unicodes.
Speaking of, maybe it'd be easier to attempt to get the identity dict into the language proper.
We tried at some point, but python-dev refused the idea. Maybe the idea has more chances for approval now that we can really show with performance numbers that it's a deep issue, as opposed to just wave our hands. Feel free to try again. In the meantime I've decided, at least myself, to stick with the approach that Python is whatever is in 2.7, and that we have to work around such issues instead of fixing them properly in the language.
Does that mean you have to follow the vagaries of the 2.7 compiler optimizations (or lack of them, as the case may be) that make for results like these?: (note fresh 2.7.2 build below ;-) [& BTW kudos to those who made it easy with the tarball and config, make] [10:30 ~]$ python Python 2.7.2 (default, Jul 8 2011, 23:38:53) [GCC 4.1.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
from ut.miscutil import disev,disex id(2000) == id(2000) True id(2000) == id(20*100) False disev('id(2000) == id(2000)') 1 0 LOAD_NAME 0 (id) 3 LOAD_CONST 0 (2000) 6 CALL_FUNCTION 1 9 LOAD_NAME 0 (id) 12 LOAD_CONST 0 (2000) 15 CALL_FUNCTION 1 18 COMPARE_OP 2 (==) 21 RETURN_VALUE disev('id(2000) == id(20*100)') 1 0 LOAD_NAME 0 (id) 3 LOAD_CONST 0 (2000) 6 CALL_FUNCTION 1 9 LOAD_NAME 0 (id) 12 LOAD_CONST 3 (2000) 15 CALL_FUNCTION 1 18 COMPARE_OP 2 (==) 21 RETURN_VALUE
Notice that 20*100 was folded to 2000, but a different constant was generated, hence, presumably, the different id. Will the next update of 2.7 do the same? BTW, disev above is just from a grabbag module of mine, amounting to def disev(s): import dis return dis.dis(compile(s,'disev','eval')) It doesn't have to be about integers:
id([]) == id([]) True id(list()) == id(list()) False disev('id([]) == id([])') 1 0 LOAD_NAME 0 (id) 3 BUILD_LIST 0 6 CALL_FUNCTION 1 9 LOAD_NAME 0 (id) 12 BUILD_LIST 0 15 CALL_FUNCTION 1 18 COMPARE_OP 2 (==) 21 RETURN_VALUE disev('id(list()) == id(list())') 1 0 LOAD_NAME 0 (id) 3 LOAD_NAME 1 (list) 6 CALL_FUNCTION 0 9 CALL_FUNCTION 1 12 LOAD_NAME 0 (id) 15 LOAD_NAME 1 (list) 18 CALL_FUNCTION 0 21 CALL_FUNCTION 1 24 COMPARE_OP 2 (==) 27 RETURN_VALUE
Of course, the is operator call will keep both references alive on the stack, so whereas you get
id([]) == id([]) True id([]), id([]) (3082823052L, 3082823052L)
Keeping the arguments alive simultaneously gives
(lambda x,y:(id(x),id(y)))([],[]) (3082823052L, 3082881516L) (lambda x,y:(id(x)==id(y)))([],[]) False
... assuming the lambda above approximates what `is' does. Using id comparisons instead of `is' can look superficially a bit weird:
id([1]) == id([2]) True
Bottom line: id does not seem to have a complete abstract definition[1][2], so how may one define a 'correct' implementation? ISTM id now (CPython 2.7.2) is more like a version-dependent debugger function that can be useful in app hacks, but with caveats ;-) Regards, Bengt Richter [1] http://docs.python.org/reference/datamodel.html#index-824 [2] http://docs.python.org/library/functions.html#id PS. How does the following cachingid.py compare to your idea of what id should ideally do? (not tested except example) (Obviously it is just a concept implementation, not resource efficient) ________________________________________ refval = id # for using old id def id(x, objcache = {}): if type(x) in (int,float,bool,tuple,str,unicode): # etc t = (type(x), x) return refval(objcache.setdefault(t, t)[1]) else: return refval(x) # as now? XXX what is abstract meaning of id(some_mutable)? ________________________________________ It tries to return the ordinary id of the first instance of equivalent immutable objects passed to it, and otherwise uses the old id, which I'm not finished thinking about ;-) At least it fixes the difference between id(2000) and id(20*100):
oldid=id from ut.cachingid import id # ut is just my utility collection oldid(2000)==oldid(2000) True oldid(2000)==oldid(20*100) False id(2000)==id(2000) True id(2000)==id(20*100) True id.func_defaults ({(<type 'int'>, 2000): (<type 'int'>, 2000)},)
A bientôt,
Armin.

Hi Bengt, On Sun, Jul 10, 2011 at 3:48 PM, Bengt Richter <bokr@oz.net> wrote:
>>> id([1]) == id([2]) True
As pointed out by Carl Friedrich, the real definition of "id" is: * if x and y are two variables, then "x is y" <=> "id(x) == id(y)". That's why in any Python implementation,
x=[1]; y=[2]; id(x) == id(y)
must return False, but not necessarily id([1]) == id([2]). A bientôt, Armin.

On 07/10/2011 04:09 PM Armin Rigo wrote:
Hi Bengt,
On Sun, Jul 10, 2011 at 3:48 PM, Bengt Richter<bokr@oz.net> wrote:
id([1]) == id([2]) True
True, I did write that. The key word in my line before that was `superficially' ;-)
As pointed out by Carl Friedrich, the real definition of "id" is:
* if x and y are two variables, then "x is y"<=> "id(x) == id(y)".
So conceptually, id is about "variables" -- not objects? Isn't a "variable" just a special case of an expression (yielding a reference to a result object, like other expressions)? ISTM that they key thing above is that "if x and y are two variables" the two (or one if same) referenced objects are guaranteed to be live at the same time. So they can't be in the same "location" unless they are the same object. Someone else mentioned this liveness issue too.
That's why in any Python implementation,
x=[1]; y=[2]; id(x) == id(y)
must return False, but not necessarily id([1]) == id([2]).
<nit>The definition with "two variables" seems unnecessarily restrictive, unless I am wrong that x=[1]; id(x) == id([2]) must also return false, since the x reference guarantees that the [2] cannot be created in the same "location" as the [1] held by the x.</nit> Something about returning location from id reminds me of the need for automatic closure creation in another context. Letting the expression result die and returning a kind of pointer to where the result object *was* seems like a dangling pointer problem, except I guess you can't dereference an id value (without hackery). Maybe id should raise an exception if the argument referenced only has a ref count of 1 (i.e., just the reference from the argument list)? Or else let id be a class and return a minimal instance only binding the passed object, and customize the compare ops to take into account type diffs etc.? Then there would be no id values without corresponding objects, and id values used in expressions would die a natural death, along with their references to their objects -- whether "variables" or expressions. Sorry to belabor the obvious ;-) Regards, Bengt
A bientôt,
Armin.

On 9 July 2011 18:17, Armin Rigo <arigo@tunes.org> wrote:
Hi,
On Sat, Jul 9, 2011 at 5:20 AM, William ML Leslie <william.leslie.ttg@gmail.com> wrote:
My point about small integers (...)
I think that your point about small integers is broken (even assuming that smallints are enabled by default, which is not the case). It means that we'd get an id() function which "behaves" as long as taking the id() of a 31-bit signed integer, and then doesn't behave as expected neither for full 32-bit integers, nor for longs, nor for floats.
...
I don't understand these two sentences, because they seem to say the exact opposite of each other...
For a non-smallint integer or float bound to x, "x is x" is tautological by virtue of x being represented by the same instance. There may be other objects with the same value, but I don't see why that must imply that they be the same object - why x == y must imply x is y for x and y of the same immutable type. It might make the identity dict in copy.py use slightly less memory, but it would make *much* more sense to optimise the specific use case according to the defined semantics of id(). copy.py is already effectively "broken" by the behaviour of non-cached ints on cpython; so copy.py is no excuse to break id() in pypy, which is a much more fundamental concept.
That does mean that id() is no longer word-sized, but it does not make it unbounded.
The "unbounded" part in my e-mail was about longs. Obviously if you are computing id(x) where x is in some finite set (like ints or floats), then the results are in some finite set too.
I'm not disagreeing with you there, but id has a *universally* finite range already, and imao, to change this is to make an annoying change to the language.
Speaking of, maybe it'd be easier to attempt to get the identity dict into the language proper.
We tried at some point, but python-dev refused the idea. Maybe the idea has more chances for approval now that we can really show with performance numbers that it's a deep issue, as opposed to just wave our hands. Feel free to try again. In the meantime I've decided, at least myself, to stick with the approach that Python is whatever is in 2.7, and that we have to work around such issues instead of fixing them properly in the language.
Ok. It would be little help to existing code. -- William Leslie

On 07/08/2011 10:31 AM William ML Leslie wrote:
On 8 July 2011 17:58, holger krekel<holger@merlinux.eu> wrote:
IOW, i think the issue here is that iterating over keys of a dict usually gives the exact same ("is") objects in CPython whereas pypy trunk does not provide that at least for ints.
I couldn't find anything precise in the official documentation on the meaning of 'is'. I think that the general understanding is that it makes no sense whatsoever on immutable objects (as in, it isn't guaranteed to do so).
Consequently, a python implementation could also cache tuples. Re-using tuples might sound unusual, but there are special cases that start to sound reasonable, such as caching the empty tuple, or copy-propogating a tuple unpack& repack. The language spec is very light on what is allowed to be a 'different object', and what is *suggested* by cpython's int caching behaviour is that the behaviour of 'is' for language-provided immutable objects can't be relied upon in any way, shape or form.
Pypy hasn't matched cpython's behaviour with ints here in a long time, so it obviously doesn't matter.
On another note: what Alex talks about as being two different cases are just one with the small int optimisation - all references can be compared by value in the C backend with small ints enabled, if the object space doesn't provide alternative behaviour.
python -c 'import this'|egrep 'rules|purity|hard' #;-)
participants (11)
-
Alex Gaynor
-
Amaury Forgeot d'Arc
-
Antonio Cuni
-
Armin Rigo
-
Bengt Richter
-
Carl Friedrich Bolz
-
Cesare Di Mauro
-
exarkun@twistedmatrix.com
-
holger krekel
-
Maciej Fijalkowski
-
William ML Leslie