Fighting the theoretical randomness of "is" on immutables
Hi all, In the context PyPy, we've recently seen again the issue of "x is y" not being well-defined on immutable constants. I've tried to summarize the issues and possible solutions in a mail to pypy-dev [1] and got some answers already. Having been convinced that the core is a language design issue, I'm asking for help from people on this list. (Feel free to cross-post.) [1] http://mail.python.org/pipermail/pypy-dev/2013-May/011299.html To summarize: the issue is a combination of various optimizations that work great otherwise. For example we can store integers directly in lists of integers, so when we read them back, we need to put them into fresh W_IntObjects (equivalent of PyIntObject). We solved temporarily the issue of "I'm getting an object which isn't ``is``-identical to the one I put in!" by making all equal integers ``is``-identical. This required hacking at ``id(x)`` as well to keep the requirement ``x is y <=> id(x)==id(y)``. This is getting annoying for strings, though -- how do you compute the id() of a long string? Give a unique long integer? And if we do the same for tuples, what about their id()? The long-term solution that seems the most stable to me would be to relax the requirement ``x is y <=> id(x)==id(y)``. If we can get away with only ``x is y <= id(x)==id(y)`` then it would allow us to implement ``is`` in a consistent way (e.g. two strings with equal content would always be ``is``-identical) while keeping id() reasonable (both in terms of complexity and of size of the resulting long number). Obviously ``x is y <=> id(x)==id(y)`` would still be true if any of ``x`` or ``y`` is not an immutable "by-value" built-in type. This is clearly a language design issue though. I can't really think of a use case that would break if we relax the requirement, but I might be wrong. It seems to me that at most some modules like pickle which use id()-keyed dictionaries will fail to find some otherwise-identical objects, but would still work (even if tuples are "relaxed" in this way, you can't have cycles with only tuples). A bientôt, Armin.
On 5/6/2013 4:46 AM, Armin Rigo wrote: 'is' *is* well-defined. In production code, the main use of 'is' is for builtin singletons, the bool doubleton, and object instances used as sentinals. The most common use, in particular, is 'if a is None:'. For such code, the result must be independent of implementation. For other immutable classes, for which 'is' is mostly irrelevant and useless, the result of some code is intentionally implementation dependent to allow optional optimizations. 'Implementation dependent' is differnt from 'random'. For such classes (int, tuple, set, string), the main use of 'is' is to test if the intended optimization is being done. In other words, for these classes, the implementation dependence is a feature. The general advice given to newbies by python-list regulars is to limit the use of 'is' with immutables to the first group of classes and never use it for the second.
In the context PyPy, we've recently seen again the issue of "x is y" not being well-defined on immutable constants.
Since immutable objects have a constant value by definition of immutable, I am not sure if you are trying to say anything more by adding the extra word.
I've tried to summarize the issues and possible solutions in a mail to pypy-dev [1] and got some answers already. Having been convinced that the core is a language design issue, I'm asking for help from people on this list. (Feel free to cross-post.)
[1] http://mail.python.org/pipermail/pypy-dev/2013-May/011299.html
To summarize: the issue is a combination of various optimizations that work great otherwise. For example we can store integers directly in lists of integers, so when we read them back, we need to put them into fresh W_IntObjects (equivalent of PyIntObject).
Interesting. I presume you only do this when the ints all fit in a machine int so that all require the same number of bytes so you can efficiently index and slice. This is sort of what strings do with characters, except for there being no char class. The similarity is that if you concatenate a string to another string and then slice it back out, you generally get a different object, but may get the same object if some optimization has that effect. For instance, in current CPython, s is ''+s is s+''. The details depend on the CPython version.
We solved temporarily the issue of "I'm getting an object which isn't ``is``-identical to the one I put in!"
Does the definition of list operations guarantee preservation of object identify? After 'somelist.append(a)', must 'somelist.pop() is a' be true? I am not sure. For immutables, it could be an issue if someone stores the id. But I don't know why someone would do that for an int. As I already said, we routinely tell people on python-list (c.l.p) that they shouldn't care about ids of ints.. The identity of an int cannot (and should not) affect the result of numerical calculation.
by making all equal integers ``is``-identical.
Which changes the definition of 'is', or rather, makes the definition implementation dependent.
This required hacking at ``id(x)`` as well to keep the requirement ``x is y <=> id(x)==id(y)``. This is getting annoying for strings, though -- how do you compute the id() of a long string? Give a unique long integer? And if we do the same for tuples, what about their id()?
The solution to the annoyance is to not do this ;-). More seriously, are you planning to unbox strings or tuples?
The long-term solution that seems the most stable to me would be to relax the requirement ``x is y <=> id(x)==id(y)``.
I see this as a definition, not a requirement. Changing the definition would break any use that depends on the definition being what it is. -- Terry Jan Reedy
On Mon, May 6, 2013 at 6:46 PM, Armin Rigo <arigo@tunes.org> wrote:
This is clearly a language design issue though. I can't really think of a use case that would break if we relax the requirement, but I might be wrong. It seems to me that at most some modules like pickle which use id()-keyed dictionaries will fail to find some otherwise-identical objects, but would still work (even if tuples are "relaxed" in this way, you can't have cycles with only tuples).
IIRC, Jython just delays calculating the object id() until it is called, and lives with it potentially being incredibly expensive to calculate. Is there some way PyPy can run with a model where "is" is defined in terms of values for immutable objects, with a lazily populated mapping from values to numeric ids if you're forced to define them through an explicit call to id()? We're not going to change the language design because people don't understand the difference between "is" and "==" and then wrongly blame PyPy for breaking their code. If you're tired of explaining to people that it's their code which is buggy rather than PyPy, then your Solution 2 (mimic'ing CPython's caching) is likely your best bet. Alternatively, we've offered to add CompatibilityWarning to CPython in the past (there may even be a preliminary patch for it on the tracker). That offer is still open, and would be applicable to this case. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le Mon, 6 May 2013 23:18:54 +1000, Nick Coghlan <ncoghlan@gmail.com> a écrit :
IIRC, Jython just delays calculating the object id() until it is called, and lives with it potentially being incredibly expensive to calculate. Is there some way PyPy can run with a model where "is" is defined in terms of values for immutable objects, with a lazily populated mapping from values to numeric ids if you're forced to define them through an explicit call to id()?
This sounds reasonable. Actually, for small ints, id() could simply be a tagged pointer (e.g. "1 + 2 * myint.value").
We're not going to change the language design because people don't understand the difference between "is" and "==" and then wrongly blame PyPy for breaking their code.
Well, if I'm doing: mylist = [x] and ``mylist[0] is x`` returns False, then I pretty much consider the Python implementation to be broken, not my code :-) Regards Antoine.
On Mon, May 6, 2013 at 11:26 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 6 May 2013 23:18:54 +1000, Nick Coghlan <ncoghlan@gmail.com> a écrit :
We're not going to change the language design because people don't understand the difference between "is" and "==" and then wrongly blame PyPy for breaking their code.
Well, if I'm doing:
mylist = [x]
and ``mylist[0] is x`` returns False, then I pretty much consider the Python implementation to be broken, not my code :-)
Yeah, that's a rather good point - I briefly forgot that the trigger here was PyPy's specialised single type containers. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 5/6/2013 10:20 AM, Nick Coghlan wrote:
On Mon, May 6, 2013 at 11:26 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 6 May 2013 23:18:54 +1000, Nick Coghlan <ncoghlan@gmail.com> a écrit :
We're not going to change the language design because people don't understand the difference between "is" and "=="
For sure. The definition "The operators is and is not test for object identity: x is y is true if and only if x and y are the same object. x is not y yields the inverse truth value. [4]" is clear enough as far as it goes. But perhaps it should be said that whether or not x and y *are* the same object, in a particular situation, may depend on the implementation. The footnote [4] "Due to automatic garbage-collection, free lists, and the dynamic nature of descriptors, you may notice seemingly unusual behaviour in certain uses of the is operator, like those involving comparisons between instance methods, or constants." tells only part of the story, and the less common part at that.
and then wrongly blame PyPy for breaking their code.
The language definition intentionally leaves 'isness' implementation defined for number and string operations in order to allow but not require optimizations. Preserving isness when mixing numbers and strings with mutable collections is a different issue.
Well, if I'm doing:
mylist = [x]
and ``mylist[0] is x`` returns False, then I pretty much consider the Python implementation to be broken, not my code :-)
If x were constrained to be an int, the comparison would not make much sense, but part of the essential nature of lists is that x could be literally any object. So unless False were a documented possibility, I might be inclined to agree with you, based on CPython precedent. The situation *is* different with type-limited arrays.
from array import array x = 1001 myray = array('i', [x]) myray[0] is x False
I think the possibility of False is implicit in "an object type which can compactly represent an array of basic values". The later phrase "the type of objects stored in them is constrained" is incorrectly worded because arrays store constrained *values*, not *objects* or even object references as lists do.
Yeah, that's a rather good point - I briefly forgot that the trigger here was PyPy's specialised single type containers.
Does implicitly replacing or implementing a list with something that is internally more like Cpython arrays than Cpython lists (as I understand what pypy is doing) violates the language spec? I re-read the doc and I am not sure. Sequences are sequences of 'items'. For example: "s[i] ith item of s, origin 0" 'Items' are not defined, but pragmatically, they can be defined either by value or identity Containment is defined in terms of equality, which itself can be defined in terms of either value or identity. For strings and ranges, the 'items' are values, not objects. They also are for bytes even though identity is recovered when objects for all possible byte values are pre-cached, as in CPython. 'Item' is necessarily left vague for mutable sequences as bytearrays also store values. The fact that Antoine's example 'works' for bytearrays is an artifact of the caching, not a language-mandated necessity.
b = bytearray() b.append(98) b[0] is 98 True
The definition for lists does not narrow 'item' either. "Lists are mutable sequences, typically used to store collections of homogeneous items (where the precise degree of similarity will vary by application)." Antoine's opinion would be more supportable if 'item' were replaced by 'object'. Guido's notion of 'homogenous' could be interpreted as supporting specialized 'lists'. On the other hand, I think explicit import, as with the array module and numarray package, is a better idea. This is especially true if an implementation intends to be a drop-in replacement for CPython. It seems to me that Armin's pain comes from trying to be both different and compatible at the same time. -- Terry Jan Reedy
On Mon, 06 May 2013 18:23:02 -0400 Terry Jan Reedy <tjreedy@udel.edu> wrote:
'Item' is necessarily left vague for mutable sequences as bytearrays also store values. The fact that Antoine's example 'works' for bytearrays is an artifact of the caching, not a language-mandated necessity.
No, it isn't. You are mixing up values and references. A bytearray or a array.array may indeed store values, but a list stores references to objects. I'm pretty sure that not respecting identity of objects stored in general-purpose containers would break a *lot* of code out there. Regards Antoine.
On 5/6/2013 6:34 PM, Antoine Pitrou wrote:
On Mon, 06 May 2013 18:23:02 -0400 Terry Jan Reedy <tjreedy@udel.edu> wrote:
'Item' is necessarily left vague for mutable sequences as bytearrays also store values. The fact that Antoine's example 'works' for bytearrays is an artifact of the caching, not a language-mandated necessity.
No, it isn't.
Yes it is. Look again at the array example.
from array import array x = 1001 myray = array('i', [x]) myray[0] is x False
Change 1001 to a cached int value such as 98 and the result is True instead of False. For the equivalent bytearray example
b = bytearray() b.append(98) b[0] is 98 True
the result is always True *because*, and only because, all byte value are (now) cached. I believe the test for that is marked as CPython-specific.
You are mixing up values and references.
No I am not. My whole post was about being careful to not to confuse the two. I noted, however, that the Python *docs* use 'item' to mean either or both. If you do not like the *doc* being unclear, clarify it.
A bytearray or a array.array may indeed store values, but a list stores references to objects.
I said exactly that in reference to CPython. As far as I know, the same is true of lists in every other implementation up until Pypy decided to optimize that away. What I also said is that I cannot read the *current* doc as guaranteeing that characterization. The reason is that the members of sequences, mutable sequences, and lists are all described as 'items'. In the first two cases, 'item' means 'value or object reference'. I see nothing in the doc to force a reader to change or particularized the meaning of 'item' in the third case. If I missed something *in the specification*, please point it out to me.
I'm pretty sure that not respecting identity of objects stored in general-purpose containers would break a *lot* of code out there.
Me too. Hence I suggested that if lists, etc, are intended to respect identity, with 'is' as currently defined, in any implementation, then the docs should say so and end the discussion. I would be happy to commit an approved patch, but I am not in a position to decide the substantive content. Hence, I tried to provide a neutral analysis that avoided confusing the CPython implementation with the Python specification. In my final paragraph, however, I did suggest that Pypy respect precedent, to avoid breaking existing code and expectations, and call their mutable sequences something other than 'list'. -- Terry Jan Reedy
On Mon, May 6, 2013 at 7:50 PM, Terry Jan Reedy <tjreedy@udel.edu> wrote:
On 5/6/2013 6:34 PM, Antoine Pitrou wrote:
On Mon, 06 May 2013 18:23:02 -0400 Terry Jan Reedy <tjreedy@udel.edu> wrote:
'Item' is necessarily left vague for mutable sequences as bytearrays also store values. The fact that Antoine's example 'works' for bytearrays is an artifact of the caching, not a language-mandated necessity.
No, it isn't.
Yes it is. Look again at the array example.
from array import array x = 1001 myray = array('i', [x]) myray[0] is x False
Change 1001 to a cached int value such as 98 and the result is True instead of False. For the equivalent bytearray example
b = bytearray() b.append(98) b[0] is 98 True
the result is always True *because*, and only because, all byte value are (now) cached. I believe the test for that is marked as CPython-specific.
You are mixing up values and references.
No I am not. My whole post was about being careful to not to confuse the two. I noted, however, that the Python *docs* use 'item' to mean either or both. If you do not like the *doc* being unclear, clarify it.
A bytearray or a array.array may indeed store values, but a list stores
references to objects.
I said exactly that in reference to CPython. As far as I know, the same is true of lists in every other implementation up until Pypy decided to optimize that away. What I also said is that I cannot read the *current* doc as guaranteeing that characterization. The reason is that the members of sequences, mutable sequences, and lists are all described as 'items'. In the first two cases, 'item' means 'value or object reference'. I see nothing in the doc to force a reader to change or particularized the meaning of 'item' in the third case. If I missed something *in the specification*, please point it out to me.
I'm pretty sure that not respecting identity of objects stored in
general-purpose containers would break a *lot* of code out there.
Me too. Hence I suggested that if lists, etc, are intended to respect identity, with 'is' as currently defined, in any implementation, then the docs should say so and end the discussion. I would be happy to commit an approved patch, but I am not in a position to decide the substantive content. Hence, I tried to provide a neutral analysis that avoided confusing the CPython implementation with the Python specification.
In my final paragraph, however, I did suggest that Pypy respect precedent, to avoid breaking existing code and expectations, and call their mutable sequences something other than 'list'.
Wouldn't the entire point of such things existing in pypy be that the implementation is irrelevant to the user and used behind the scenes automatically in the common case when a container is determined to fit the special constraint? I personally do not think we should guarantee that "mylist[0] = x; assert x is mylist[0]" succeeds when x is an immutable type other than None. If something is immutable and not intended to be a singleton and does not define equality (like None or sentinel values commonly tested using is such as arbitrary object() instances) it needs to be up to the language VM to determine when to copy or not in most situations. You already gave the example of the interned small integers in CPython. String constants and names used in code are also interned in today's CPython implementation. This doesn't tend to trip any real code up. -gps
On Mon, 06 May 2013 22:50:55 -0400 Terry Jan Reedy <tjreedy@udel.edu> wrote:
A bytearray or a array.array may indeed store values, but a list stores references to objects.
I said exactly that in reference to CPython. As far as I know, the same is true of lists in every other implementation up until Pypy decided to optimize that away. What I also said is that I cannot read the *current* doc as guaranteeing that characterization.
In the absence of more precise specification, the reference is IMO the reference interpreter, a.k.a. CPython, and its behaviour is more than well-known and stable over time here.
I'm pretty sure that not respecting identity of objects stored in general-purpose containers would break a *lot* of code out there.
Me too. Hence I suggested that if lists, etc, are intended to respect identity, with 'is' as currently defined, in any implementation, then the docs should say so and end the discussion. I would be happy to commit an approved patch, but I am not in a position to decide the substantive content.
For me, a patch that mandated general-purpose containers (list, dict, etc.) respect object identity would be ok. Regards Antoine.
Hi Antoine, On Tue, May 7, 2013 at 8:25 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For me, a patch that mandated general-purpose containers (list, dict, etc.) respect object identity would be ok.
Thanks, that's also my opinion. In PyPy's approach, in trying to emulate CPython vs. trying to convince users that "is" is sometimes a bad idea, we might eventually end up at the extreme side, which can be seen as where CPython would be if it cached *all* ints, longs, floats, complexes, strings, unicodes and tuples. The original question in this thread was about if it's ok for two objects x and y to satisfy "x is y" while at the same time "id(x) != id(y)". I think by now that it would only create more confusion (even if only in some very special cases). We'll continue to maintain the invariant then, and if it requires creating extremely large values for id(), too bad. (1) A bientôt, Armin. (1) the Jython approach of caching the id's is not applicable here: the objects whose id are hard to get are precisely those that don't have a long-living representation as object in memory. You can't cache an id with a key that is, say, a double-word "long" --- if this double-word is not an object, but merely a value, it can't be used as key in a weakdict. You don't have a way of knowing when you can remove it from the cache.
On Tue, May 7, 2013 at 6:27 PM, Armin Rigo <arigo@tunes.org> wrote:
Hi Antoine,
On Tue, May 7, 2013 at 8:25 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For me, a patch that mandated general-purpose containers (list, dict, etc.) respect object identity would be ok.
Thanks, that's also my opinion.
In PyPy's approach, in trying to emulate CPython vs. trying to convince users that "is" is sometimes a bad idea, we might eventually end up at the extreme side, which can be seen as where CPython would be if it cached *all* ints, longs, floats, complexes, strings, unicodes and tuples.
The original question in this thread was about if it's ok for two objects x and y to satisfy "x is y" while at the same time "id(x) != id(y)". I think by now that it would only create more confusion (even if only in some very special cases). We'll continue to maintain the invariant then, and if it requires creating extremely large values for id(), too bad. (1)
Yeah, I've been trying to come up with a way to phrase the end result that doesn't make my brain hurt, but I've mostly failed. The details below are the closest I've come to something that makes sense to me. With equality, the concepts of hashing and value have a clear relationship: x == y implies hash(x) == hash(y), but there's no implication going in the other direction. Even if the hashes are the same, the values may be different (you can have hash(x) == hash(y) without having x == y). NaN's aside, you also have the relationship that x is y implies x == y (and the standard containers assume this). Again, there's no implication in the other direction. Two objects may be equal, while having different identities (such as 0 == 0.0 == 0j) The definition of object identity is that x is y implies id(x) == id(y) *and* vice-versa. The suggested change would actually involving defining a *new* language concept, a "reference id", where ref_id(x) == ref_id(y) implies x is y, but the reverse is not true. Thus, this is actually a suggestion for two changes rolled into one: 1. Add the "reference id" concept 2. Change the id() builtin to return the reference id rather than the object id I think the array.array solution is a more tolerable one: provide explicit value based containers that are known not to be identity preserving. If you want maximum speed, you have to be prepared to deal with the difference in semantics. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Mon, May 6, 2013 at 4:46 AM, Armin Rigo <arigo@tunes.org> wrote:
This is clearly a language design issue though. I can't really think of a use case that would break if we relax the requirement, but I might be wrong. It seems to me that at most some modules like pickle which use id()-keyed dictionaries will fail to find some otherwise-identical objects, but would still work (even if tuples are "relaxed" in this way, you can't have cycles with only tuples).
I don't know if I've precisely understood the change you're proposing, but I do know that in PEAK-Rules I use id() as an approximation for "is" in order to build indexes of various "parameter is some_object" conditions, for various "some_objects" and a given parameter. The rule engine takes id(parameter) at call time and then looks it up to obtain a subset of applicable rules. IIUC, this would require that either "x is y" equates to "id(x)==id(y)", or else that there be some way to determine in advance all the possible id(y)s that are now or would ever be "is x", so they can be placed in the index. Otherwise, any use of an "is" condition would require a linear search of the possibilities, as you could not rule out the possibility that a given x was "is" to *some* y already in the index. Of course, rules using "is" tend to be few and far between, outside of some special cases, and their use with simple integers and strings would be downright silly. And on top of that, I'm not even sure whether the "a <= b" notation you used was meant to signify "a implies b" or "b implies a". ;-) But since you mentioned id()-keyed dictionaries and this is another use of them that I know of, I figured I should at least throw it out there for information's sake, regardless of which side of the issue it lands on. ;-)
participants (6)
-
Antoine Pitrou
-
Armin Rigo
-
Gregory P. Smith
-
Nick Coghlan
-
PJ Eby
-
Terry Jan Reedy