Mailman 3 x is y <=> id(x)==id(y) - pypy-dev

newer
Re: [pypy-dev] [pypy-commit] cffi...

x is y <=> id(x)==id(y)

older
How can I make more readable the C...

Armin Rigo

5 May 2013 5 May '13

3:59 a.m.

Hi all, I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes. It does not for strs or unicodes or tuples. Now of course someone on python-dev was (indirectly) complaining that you can compare in CPython ``x is ' '``, which works because single-character strings are cached, but not in PyPy. I'm sure someone else has been bitten by writing in CPython ``x is ()``, which is also cached there. (Fwiw I think that there is a design flaw somewhere in Python, to allow "1 is 1" to be executed without any error but also without any well-defined result...) Can we fix it once and for all? It's annoying because of id: if we want ``x is y`` for equal huge strings x and y, but still want ``id(x)==id(y)``, then we have to compute ``id(some_string)`` in a rather slow way, producing a huge number. The same for tuples: if we always want ``(1, 2) is (1, 2)`` then we need to compute ``id(some_tuple)`` recursively, which can also lead to huge numbers. In fact such a definition can explode the memory: ``a = (); for i in range(100): a = (a, a); id(a)`` would likely need a 2**100-digits number. Solution 2 would be to add these hacks specially for cases that CPython caches: I think by now we're only missing empty or single-char strings or unicodes, and empty tuple. Solution 3 would be to drop half of the rule, keeping only ``id(x)==id(y) => x is y``. This would be the easiest, as we could remove the complicated computations already done for longs or floats or complexes. We'd clearly document it as a difference from CPython. The question is what kind of code might break if we drop the case ``x is y => id(x)==id(y)``. A bientôt, Armin.

Show replies by date

Steven D'Aprano

5 May 5 May

5:20 a.m.

On 05/05/13 19:59, Armin Rigo wrote:

...

Hi all,

I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes. It does not for strs or unicodes or tuples.

I don't understand why immutability comes into this. The `is` operator is supposed to test whether the two operands are the same object, nothing more, nothing less. Immutable, or mutable, it makes no difference. Now, it may be that *some* immutable objects may (implicitly, or explicitly) promise that you will never have two objects with the same value. For example, float might cache every object created, so that once you have created a float 23.45910234718, it will *always* be reused whenever a float with that value is needed. That would be allowed. But if float does not cache the value, and so you have two different float objects, with different IDs, then it is absolutely wrong for PyPy to treat `is` as == instead of testing object identity. Have I misunderstood what you are saying? -- Steven

Maciej Fijalkowski

11:35 a.m.

On Sun, May 5, 2013 at 1:20 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...

On 05/05/13 19:59, Armin Rigo wrote:

...
Hi all,

I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes. It does not for strs or unicodes or tuples.

I don't understand why immutability comes into this. The `is` operator is supposed to test whether the two operands are the same object, nothing more, nothing less. Immutable, or mutable, it makes no difference.

Now, it may be that *some* immutable objects may (implicitly, or explicitly) promise that you will never have two objects with the same value. For example, float might cache every object created, so that once you have created a float 23.45910234718, it will *always* be reused whenever a float with that value is needed. That would be allowed.

But if float does not cache the value, and so you have two different float objects, with different IDs, then it is absolutely wrong for PyPy to treat `is` as == instead of testing object identity.

Have I misunderstood what you are saying?

Immutability is important because you can't cache immutable objects. It's true what you're saying, but we consistently see bug reports about people comparing ints or strings with is and complaining that they work fine on cpython, but not on pypy. Also, you expect to have the same identity if you store stuff in the list and then read out of it - which is impossible if you don't actually have any objects in the list, just store unwrapped ones. Cheers, fijal

Steven D'Aprano

1:16 p.m.

On 06/05/13 03:35, Maciej Fijalkowski wrote:

...

On Sun, May 5, 2013 at 1:20 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...
On 05/05/13 19:59, Armin Rigo wrote:

...
Hi all,

I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes. It does not for strs or unicodes or tuples.

I don't understand why immutability comes into this. The `is` operator is supposed to test whether the two operands are the same object, nothing more, nothing less. Immutable, or mutable, it makes no difference.

Now, it may be that *some* immutable objects may (implicitly, or explicitly) promise that you will never have two objects with the same value. For example, float might cache every object created, so that once you have created a float 23.45910234718, it will *always* be reused whenever a float with that value is needed. That would be allowed.

But if float does not cache the value, and so you have two different float objects, with different IDs, then it is absolutely wrong for PyPy to treat `is` as == instead of testing object identity.

Have I misunderstood what you are saying?

Immutability is important because you can't cache immutable objects.

Yes, I know that :-) but that has nothing to do with the behaviour of `is`.

...

It's true what you're saying, but we consistently see bug reports about people comparing ints or strings with is and complaining that they work fine on cpython, but not on pypy.

Then their code is buggy, not PyPy. But you know that :-) I don't believe that PyPy should take extraordinary effort to protect people from the consequences of writing buggy code. But putting that aside, I would expect that: x is y <=> id(x) == id(y) The docs say: "The operators is and is not test for object identity: x is y is true if and only if x and y are the same object. x is not y yields the inverse truth value." http://docs.python.org/2/reference/expressions.html#index-68 and "id(object) Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value." http://docs.python.org/2/library/functions.html#id So each object has a single, unique, constant ID during its lifetime. So if id(x) == id(y) and x and y overlap in their lifetime, that implies that x and y are the same object. Likewise, if x and y are the same object, that implies that they have the same ID.

...

Also, you expect to have the same identity if you store stuff in the list and then read out of it - which is impossible if you don't actually have any objects in the list, just store unwrapped ones.

Ah, now that is an interesting question! My lack of experience with PyPy is going to show now. I take it that PyPy might optimize away the objects inside a list, storing only unboxed values? This is a really hard question. If I do this: a = b = X # regardless of what X is mylist = [a, None] assert mylist[0] is a assert mylist[0] is b both assertions must pass, no matter what X is, whether mutable or immutable. But if the values in mylist get unwrapped, then you would have to reconstruct the object identities, and I imagine that this would be painful. But it would be a shame to give up the opportunity for optimizations that unboxing could give. Have I understood the nature of your problem correctly? -- Steven

Armin Rigo

1:41 p.m.

Hi all, On Sun, May 5, 2013 at 9:16 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...

...
It's true what you're saying, but we consistently see bug reports about people comparing ints or strings with is and complaining that they work fine on cpython, but not on pypy.

Then their code is buggy, not PyPy. But you know that :-)

This is precisely what this thread is about: such "buggy" code that uses "is" to compare two immutable objects. At this point, the question is not "would it cause any trouble in existing programs to say that "x is not y" when CPython in the same program says that "x is y", because we know that the answer to that is "yes". We already found out a perfectly reasonable fix for "small" objects: two equal ints are always "is"-identical and have the same id() in PyPy. This is a nice way to solve the above problem. If anything it creates the opposite problem: some code that works on PyPy might not work on CPython. If PyPy becomes used enough, CPython will then have to care about that too, and we'll end up with a well-defined definition of "is" on immutable objects :-) But we're not (yet) using the same idea on *all* types of immutable objects. So what we're concerned about now is whether it could be implemented efficiently: the answer could be "yes if we forget about strictly enforcing "x is y <=> id(x) == id(y)". So, the question: although it's documented to be wrong, would it actually cause any trouble to relax this requirement?

...

a = b = X # regardless of what X is mylist = [a, None] assert mylist[0] is a assert mylist[0] is b

both assertions must pass, no matter what X is, whether mutable or immutable.

I *think* that in this case the assertions cannot fail in PyPy either. If X is a string, then we get as "mylist[0]" an object that is a different W_StringObject but containing internally the same RPython-level string, and as such (because we tweaked "is") they compare "is"-identical. But that seems like a problem waiting to happen: if in the future we're using a list strategy for a list of single characters, then W_StringObjects containing single characters will be rebuilt out of an RPython list of characters, and not be "is"-identical under our current definition. In addition, the problem right now is about code like ``if x[5] is '.': ...`` which happens to work as expected on CPython, but not on PyPy. In PyPy's case the two strings x[5] and '.' are using different RPython-level strings. A bientôt, Armin.

Jacob Hallén

4:10 p.m.

Personally, I think that being implementation detail compatible with CPython is the way tio go if we want to achieve maximum popularity in the short run. Making a sane implementation (Armins third option) is the one that I think will serve the Python community in the best way in the long run. Using "is" as a comparison when you mean "==" is a bad meme that has been very hard to get rid of. The person who's opinion on this matter that I would value the most is Guido's. I suggest asking him. Jacob Sunday 05 May 2013 you wrote:

...

Hi all,

On Sun, May 5, 2013 at 9:16 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...
...
It's true what you're saying, but we consistently see bug reports about people comparing ints or strings with is and complaining that they work fine on cpython, but not on pypy.

Then their code is buggy, not PyPy. But you know that :-)

This is precisely what this thread is about: such "buggy" code that uses "is" to compare two immutable objects. At this point, the question is not "would it cause any trouble in existing programs to say that "x is not y" when CPython in the same program says that "x is y", because we know that the answer to that is "yes".

We already found out a perfectly reasonable fix for "small" objects: two equal ints are always "is"-identical and have the same id() in PyPy. This is a nice way to solve the above problem. If anything it creates the opposite problem: some code that works on PyPy might not work on CPython. If PyPy becomes used enough, CPython will then have to care about that too, and we'll end up with a well-defined definition of "is" on immutable objects :-)

But we're not (yet) using the same idea on *all* types of immutable objects. So what we're concerned about now is whether it could be implemented efficiently: the answer could be "yes if we forget about strictly enforcing "x is y <=> id(x) == id(y)". So, the question: although it's documented to be wrong, would it actually cause any trouble to relax this requirement?

...
a = b = X # regardless of what X is mylist = [a, None] assert mylist[0] is a assert mylist[0] is b

both assertions must pass, no matter what X is, whether mutable or immutable.

I *think* that in this case the assertions cannot fail in PyPy either. If X is a string, then we get as "mylist[0]" an object that is a different W_StringObject but containing internally the same RPython-level string, and as such (because we tweaked "is") they compare "is"-identical. But that seems like a problem waiting to happen: if in the future we're using a list strategy for a list of single characters, then W_StringObjects containing single characters will be rebuilt out of an RPython list of characters, and not be "is"-identical under our current definition.

In addition, the problem right now is about code like ``if x[5] is '.': ...`` which happens to work as expected on CPython, but not on PyPy. In PyPy's case the two strings x[5] and '.' are using different RPython-level strings.

A bientôt,

Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

Amaury Forgeot d'Arc

5:38 a.m.

Hi, 2013/5/5 Armin Rigo <arigo@tunes.org>

...

Hi all,

I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes. It does not for strs or unicodes or tuples. Now of course someone on python-dev was (indirectly) complaining that you can compare in CPython ``x is ' '``, which works because single-character strings are cached, but not in PyPy. I'm sure someone else has been bitten by writing in CPython ``x is ()``, which is also cached there.

Strings are not always cached; with CPython2.7:

...

...
...
x = u'é'.encode('ascii', 'ignore') x == '', x is '' (True, False)

...

(Fwiw I think that there is a design flaw somewhere in Python, to allow "1 is 1" to be executed without any error but also without any well-defined result...)

Can we fix it once and for all? It's annoying because of id: if we want ``x is y`` for equal huge strings x and y, but still want ``id(x)==id(y)``, then we have to compute ``id(some_string)`` in a rather slow way, producing a huge number. The same for tuples: if we always want ``(1, 2) is (1, 2)`` then we need to compute ``id(some_tuple)`` recursively, which can also lead to huge numbers. In fact such a definition can explode the memory: ``a = (); for i in range(100): a = (a, a); id(a)`` would likely need a 2**100-digits number.

Solution 2 would be to add these hacks specially for cases that CPython caches: I think by now we're only missing empty or single-char strings or unicodes, and empty tuple.

Solution 3 would be to drop half of the rule, keeping only ``id(x)==id(y) => x is y``. This would be the easiest, as we could remove the complicated computations already done for longs or floats or complexes. We'd clearly document it as a difference from CPython. The question is what kind of code might break if we drop the case ``x is y => id(x)==id(y)``.

A bientôt,

Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

-- Amaury Forgeot d'Arc

Armin Rigo

1:18 p.m.

Hi Amaury, On Sun, May 5, 2013 at 1:38 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:

...

Strings are not always cached; with CPython2.7:

...
...
...
x = u'é'.encode('ascii', 'ignore') x == '', x is '' (True, False)

That's true, there are such cases, but that's partially irrelevant for this issue: strings that *sometimes,* or *often,* end up with the same id() in CPython. Should they also end up with the same id() in PyPy? A bientôt, Armin.

Ilya Osadchiy

2:11 p.m.

On Sun, May 5, 2013 at 12:59 PM, Armin Rigo <arigo@tunes.org> wrote:

...

Can we fix it once and for all? It's annoying because of id: if we want ``x is y`` for equal huge strings x and y, but still want ``id(x)==id(y)``, then we have to compute ``id(some_string)`` in a rather slow way, producing a huge number. The same for tuples: if we always want ``(1, 2) is (1, 2)`` then we need to compute ``id(some_tuple)`` recursively, which can also lead to huge numbers. In fact such a definition can explode the memory: ``a = (); for i in range(100): a = (a, a); id(a)`` would likely need a 2**100-digits number.

If the "id(x)==id(y)" requirement is removed, does it mean that "x is y" for immutable types is simply "x==y"? So if we have ``a = (); for i in range(100): a = (a, a); b = (a, a)`` then "a is b" will be computationally expensive?

Armin Rigo

6 May 6 May

1:38 a.m.

Hi Ilya, On Sun, May 5, 2013 at 10:11 PM, Ilya Osadchiy <osadchiy.ilya@gmail.com> wrote:

...

If the "id(x)==id(y)" requirement is removed, does it mean that "x is y" for immutable types is simply "x==y"? So if we have ``a = (); for i in range(100): a = (a, a); b = (a, a)`` then "a is b" will be computationally expensive?

It's not exactly ``x==y``: for tuples it means recursively checking that items are ``is``-identical. It's possible to avoid the computational explosion, like CPython did for equality a long time ago (up to maybe 2.3?) before it was removed. You basically want to check if the a and b objects are "in a bisimulation" or not, which can be done without visiting the same object more than once, for any connexion graph. The reason I think it's a good idea (or at least not a bad idea) to reintroduce the complexity of bisimulation where CPython removed it, is that the purpose is slightly different and not visible to the user at all. If I remember correctly it was removed because it had hard-to-explain effects on when and how many times the user's ``__eq__()`` methods were called; but there is no user-overridable code involved here, merely an "implementation detail". It could equivalently be solved by aggressively caching all tuple creation. A bientôt, Armin.

Michael Hudson-Doyle

5 May 5 May

2:40 p.m.

On 5 May 2013 21:59, Armin Rigo <arigo@tunes.org> wrote:

...

Hi all,

I'm just wondering again about some "bug" reports that are not bugs, about people misusing "is" to compare two immutable objects. The current situation in PyPy is that "is" works like "==" for ints, longs, floats or complexes.

I want to say something about negative zeroes here.... Cheers, mwh

Armin Rigo

6 May 6 May

12:52 a.m.

Hi Michael, On Sun, May 5, 2013 at 10:40 PM, Michael Hudson-Doyle <micahel@gmail.com> wrote:

...

I want to say something about negative zeroes here....

Right: on floats it's not actually the usual equality, but equality of the bit pattern (using float2longlong). A bientôt, Armin.

Amaury Forgeot d'Arc

1:03 a.m.

2013/5/6 Armin Rigo <arigo@tunes.org>

...

On Sun, May 5, 2013 at 10:40 PM, Michael Hudson-Doyle <micahel@gmail.com> wrote:

...
I want to say something about negative zeroes here....

Right: on floats it's not actually the usual equality, but equality of the bit pattern (using float2longlong).

Except for NaN... -- Amaury Forgeot d'Arc

William ML Leslie

1:25 a.m.

On 6 May 2013 17:03, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:

...

2013/5/6 Armin Rigo <arigo@tunes.org>

...
On Sun, May 5, 2013 at 10:40 PM, Michael Hudson-Doyle <micahel@gmail.com> wrote:

...
I want to say something about negative zeroes here....

Right: on floats it's not actually the usual equality, but equality of the bit pattern (using float2longlong).

Except for NaN...

It's perfectly acceptable for NaN to `is` on their bit pattern. -- William Leslie Notice: Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely may reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to deny you those rights would be illegal without prior contractual agreement.

Steven D'Aprano

7:12 a.m.

On Mon, May 06, 2013 at 05:25:24PM +1000, William ML Leslie wrote:

...

On 6 May 2013 17:03, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:

...
2013/5/6 Armin Rigo <arigo@tunes.org>

...
On Sun, May 5, 2013 at 10:40 PM, Michael Hudson-Doyle <micahel@gmail.com> wrote:

...
I want to say something about negative zeroes here....

Right: on floats it's not actually the usual equality, but equality of the bit pattern (using float2longlong).

Except for NaN...

It's perfectly acceptable for NaN to `is` on their bit pattern.

Not unless the implementation caches floats. Otherwise you could have two distinct instances with the same bit pattern. NANs are no different from other floats in that the language doesn't guarantee that there is only one of them. Unless an implementation ensures that there is *exactly one* float object with a given bit pattern, then you can have multiple instances of a specific NAN, and two NANs with the same bit pattern may be distinct objects. Although... a thought comes to mind. Since floats are immutable, you could add an abstraction between the actual objects in memory as seen by the low-level implementation, and what are seen as distinct objects by high-level Python code. So two floats with the same bit-pattern in two different memory locations could nevertheless be seen by Python as one instance. I have no idea whether this is plausible, or if PyPy already does this, or whether I'm talking sheer nonsense. Of course the IDs would have to be the same, and that's tricky, but I guess that's what this thread is about. -- Steven

Armin Rigo

8:03 a.m.

Hi Steven, On Mon, May 6, 2013 at 3:12 PM, Steven D'Aprano <steve@pearwood.info> wrote:

...

I have no idea whether this is plausible, or if PyPy already does this, or whether I'm talking sheer nonsense.

PyPy already does this.

...

Of course the IDs would have to be the same, and that's tricky, but I guess that's what this thread is about.

This is not tricky: the id is the bitpattern seen as an 8-bytes integer. This thread is about the harder cases of strings and tuples. A bientôt, Armin.

Simon Cross

5 May 5 May

4:43 p.m.

Solution 3 sounds bad since it breaks things in PyPy for people who were using "is" more correctly in CPython.

Alex Gaynor

4:45 p.m.

I wonder if maybe we can't have some sort of flag to add extra compatibility warnings, and then have a warning when `is` is used ints, strings, etc? Alex On Sun, May 5, 2013 at 3:43 PM, Simon Cross <hodgestar@gmail.com> wrote:

...

Solution 3 sounds bad since it breaks things in PyPy for people who were using "is" more correctly in CPython. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero GPG Key fingerprint: 125F 5C67 DFE9 4084

Simon Cross

4:48 p.m.

I was thinking along similar signs -- we could ask for things like "x is ''" or "x is 3" to be added to PEP8 (I think any use of "is" with a constant on one or more sides is likely suspect).

Armin Rigo

6 May 6 May

12:54 a.m.

Hi Simon, On Mon, May 6, 2013 at 12:48 AM, Simon Cross <hodgestar@gmail.com> wrote:

...

I was thinking along similar signs -- we could ask for things like "x is ''" or "x is 3" to be added to PEP8 (I think any use of "is" with a constant on one or more sides is likely suspect).

That may be a good idea. If the compiler emits SyntaxWarnings for these cases, then maybe it's all we need to cover most of the bad usages. A bientôt, Armin.

Christian Tismer

10 May 10 May

3:26 a.m.

On 06.05.13 08:54, Armin Rigo wrote:

...

Hi Simon,

...
I was thinking along similar signs -- we could ask for things like "x is ''" or "x is 3" to be added to PEP8 (I think any use of "is" with a constant on one or more sides is likely suspect). That may be a good idea. If the compiler emits SyntaxWarnings for

On Mon, May 6, 2013 at 12:48 AM, Simon Cross <hodgestar@gmail.com> wrote: these cases, then maybe it's all we need to cover most of the bad usages.

I highly appreciate this idea, too! Educating people to avoid mis-use of "is" has probably more impact in the long term, because the pep8 module is pretty often used as a measure of code cleaning. cheers - chris -- Christian Tismer :^) <mailto:tismer@stackless.com> Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/ 14482 Potsdam : PGP key -> http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

4114

Age (days ago)

4119

Last active (days ago)

List overview

Download

20 comments

11 participants

participants (11)

Alex Gaynor
Amaury Forgeot d'Arc
Armin Rigo
Christian Tismer
Ilya Osadchiy
Jacob Hallén
Maciej Fijalkowski
Michael Hudson-Doyle
Simon Cross
Steven D'Aprano
William ML Leslie

x is y <=> id(x)==id(y)

Ilya Osadchiy

tags

participants (11)