PyObject_RichCompareBool identity shortcut
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
# also:
l = [nan] nan in l True l.index(nan) 0 l[0] == nan False
The identity test is not in container comparators, but in PyObject_RichCompareBool: /* Quick result when objects are the same. Guarantees that identity implies equality. */ if (v == w) { if (op == Py_EQ) return 1; else if (op == Py_NE) return 0; } The guarantee referred to in the comment is not only (AFAICT) undocumented, but contradicts the documentation, which states that the result should be the "equivalent of o1 op o2". Calling PyObject_RichCompareBool is inconsistent with calling PyObject_RichCompare and converting its result to bool manually, something that wrappers (C++) and generators (cython) might reasonably want to do themselves, for various reasons. If this is considered a bug, I can open an issue. Hrvoje
Wiadomość napisana przez Hrvoje Niksic w dniu 2011-04-27, o godz. 11:37:
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
# also:
l = [nan] nan in l True l.index(nan) 0 l[0] == nan False
This surprises me as well. I guess this is all related to the fact that:
nan is nan True
Have a look at this as well:
inf = float('inf') inf == inf True [inf] == [inf] True l = [inf] inf in l True l.index(inf) 0 l[0] == inf True
# Or even:
inf+1 == inf-1 True
For the infinity part, I believe this is related to the funky IEEE 754 standard. I found some discussion about this here: http://compilers.iecc.com/comparch/article/98-07-134 -- Best regards, Łukasz Langa
2011/4/27 Łukasz Langa
# Or even:
inf+1 == inf-1 True
For the infinity part, I believe this is related to the funky IEEE 754 standard. I found some discussion about this here: http://compilers.iecc.com/comparch/article/98-07-134
The inf behaviour is fine (inf != inf only when you start talking about aleph levels, and IEEE 754 doesn't handle those). It's specifically `nan` that is problematic, as it is one of the very few cases that breaks the reflexivity of equality. That said, the current behaviour was chosen deliberately so that containers could cope with `nan` at least somewhat gracefully: http://bugs.python.org/issue4296 Issue 10912 added an explicit note about this behaviour to the 3.x series documentation, but that has not as yet been backported to 2.7 (I reopened the issue to request such a backport). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Apr 27, 2011, at 2:37 AM, Hrvoje Niksic wrote:
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
Would also be surprised if you put an object in a dictionary but couldn't get it out? Or added it to a list but its count was zero? Identity-implies-equality is necessary so that classes can maintain their invariants and so that programmers can reason about their code. It is not just in PyObject_RichCompareBool, it is deeply embedded in the language (the logic inside dicts for example). It is not a short-cut, it is a way of making sure that internally we can count on equality relations reflexive, symmetric, and transitive. A programmer needs to be able to make basic deductions such as the relationship between the two forms of the in-operator: for elem in somelist: assert elem in somelist # this should never fail. What surprises me is that anyone gets surprised by anything when experimenting with an object that isn't equal to itself. It is roughly in the same category as creating a __hash__ that has no relationship to __eq__ or making self-referencing sets or setting False,True=1,0 in python 2. See http://bertrandmeyer.com/2010/02/06/reflexivity-and-other-pillars-of-civiliz... for a nice blog post on the subject. Raymond
On Wed, Apr 27, 2011 at 7:39 AM, Raymond Hettinger
On Apr 27, 2011, at 2:37 AM, Hrvoje Niksic wrote:
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
Would also be surprised if you put an object in a dictionary but couldn't get it out? Or added it to a list but its count was zero? Identity-implies-equality is necessary so that classes can maintain their invariants and so that programmers can reason about their code. It is not just in PyObject_RichCompareBool, it is deeply embedded in the language (the logic inside dicts for example). It is not a short-cut, it is a way of making sure that internally we can count on equality relations reflexive, symmetric, and transitive. A programmer needs to be able to make basic deductions such as the relationship between the two forms of the in-operator: for elem in somelist: assert elem in somelist # this should never fail. What surprises me is that anyone gets surprised by anything when experimenting with an object that isn't equal to itself. It is roughly in the same category as creating a __hash__ that has no relationship to __eq__ or making self-referencing sets or setting False,True=1,0 in python 2. See http://bertrandmeyer.com/2010/02/06/reflexivity-and-other-pillars-of-civiliz... for a nice blog post on the subject.
Maybe we should just call off the odd NaN comparison behavior? -- --Guido van Rossum (python.org/~guido)
On Thu, Apr 28, 2011 at 12:53 AM, Guido van Rossum
What surprises me is that anyone gets surprised by anything when experimenting with an object that isn't equal to itself. It is roughly in the same category as creating a __hash__ that has no relationship to __eq__ or making self-referencing sets or setting False,True=1,0 in python 2. See http://bertrandmeyer.com/2010/02/06/reflexivity-and-other-pillars-of-civiliz... for a nice blog post on the subject.
Maybe we should just call off the odd NaN comparison behavior?
Rereading Meyer's article (I read it last time this came up, but it's a nice piece, so I ended up going over it again this time) the quote that leapt out at me was this one: """A few of us who had to examine the issue recently think that — whatever the standard says at the machine level — a programming language should support the venerable properties that equality is reflexive and that assignment yields equality. Every programming language should decide this on its own; for Eiffel we think this should be the specification. Do you agree?""" Currently, Python tries to split the difference: "==" and "!=" follow IEEE754 for NaN, but most other operations involving builtin types rely on the assumption that equality is always reflexive (and IEEE754 be damned). What that means is that "correct" implementations of methods like __contains__, __eq__, __ne__, index() and count() on containers should be using "x is y or x == y" to enforce reflexivity, but most such code does not (e.g. our own collections.abc.Sequence implementation gets those of these that it implements wrong, and hence Sequence based containers will handle NaN in a way that differs from the builtin containers) And none of that is actually documented anywhere (other than a behavioural note in the 3.x documentation for PyObject_RichCompareBool), so it's currently just an implementation detail of CPython that most of the builtin containers behave that way in practice. Given the status quo, what would seem to be the path of least resistance is to: - articulate in the language specification which container special methods are expected to enforce reflexivity of equality (even for non-reflexive types) - articulate in the library specification which ordinary container methods enforce reflexivity of equality - fix any standard library containers that don't enforce reflexivity to do so where appropriate (e.g. collections.abc.Sequence) Types with a non-reflexive notion of equality still wouldn't play nicely with containers that didn't enforce reflexivity where appropriate, but bad interactions between 3rd party types isn't really something we can prevent. Backing away from having float and decimal.Decimal respect the IEEE754 notion of NaN inequality at this late stage of the game seems like one for the "too hard" basket. It also wouldn't achieve much, since we want the builtin containers to preserve their invariants even for 3rd party types with a non-reflexive notion of equality. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Apr 27, 2011 at 11:31 AM, Nick Coghlan
Backing away from having float and decimal.Decimal respect the IEEE754 notion of NaN inequality at this late stage of the game seems like one for the "too hard" basket.
Why? float('nan') has always been in the use-at-your-own-risk territory despite recent efforts to support it across Python platforms. I cannot speak about decimal.Decimal (and decimal is a different story because it is tied to a particular standard), but the only use of non-reflexifity for float nans I've seen was use of x != x instead of math.isnan(x).
It also wouldn't achieve much, since we want the builtin containers to preserve their invariants even for 3rd party types with a non-reflexive notion of equality.
These are orthogonal issues. A third party type that plays with __eq__ and other basic operations can easily break stdlib algorithms no matter what we do. Therefore it is important to document the properties of the types that each algorithm relies on. It is more important, however that stdlib types do not break 3rd party's algorithms. I don't think I've ever seen a third party type that deliberately defines a non-reflexive __eq__ except as a side effect of using float attributes or C float members in the underlying structure. (Yes, decimal is a counter-example, but this is a very special case.)
On 4/27/2011 8:31 AM, Nick Coghlan wrote:
What that means is that "correct" implementations of methods like __contains__, __eq__, __ne__, index() and count() on containers should be using "x is y or x == y" to enforce reflexivity, but most such code does not (e.g. our own collections.abc.Sequence implementation gets those of these that it implements wrong, and hence Sequence based containers will handle NaN in a way that differs from the builtin containers)
+1 to everything Nick said. One issue that I don't fully understand: I know there is only one instance of None in Python, but I'm not sure where to discover whether there is only a single, or whether there can be multiple, instances of NaN or Inf. The IEEE 754 spec is clear that there are multiple bit sequences that can be used to represent these, so I would hope that there can be, in fact, more than one value containing NaN (and Inf). This would properly imply that a collection should correctly handle the case of storing multiple, different items using different NaN (and Inf) instances. A dict, for example, should be able to hold hundreds of items with the index value of NaN. The distinction between "is" and "==" would permit proper operation, and I believe that Python's "rebinding" of names to values rather than the copying of values to variables makes such a distinction possible to use in a correct manner. Can someone confirm or explain this issue?
On 4/27/2011 2:41 PM, Glenn Linderman wrote:
One issue that I don't fully understand: I know there is only one instance of None in Python, but I'm not sure where to discover whether there is only a single, or whether there can be multiple, instances of NaN or Inf.
I am sure there are multiple instances with just one bit pattern, the same as other floats. Otherwise, float('nan') would have to either randomly or systematically choose from among the possibilities. Ugh. There are functions in the math module that pull apart (and put together) floats.
The IEEE 754 spec is clear that there are multiple bit sequences that can be used to represent these,
Anyone actually interested in those should use C or possibly the math module float assembly function.
so I would hope that there can be, in fact, more than one value containing NaN (and Inf).
If you do not know which pattern is which, what use could such passibly be? -- Terry Jan Reedy
Terry Reedy wrote:
On 4/27/2011 2:41 PM, Glenn Linderman wrote:
One issue that I don't fully understand: I know there is only one instance of None in Python, but I'm not sure where to discover whether there is only a single, or whether there can be multiple, instances of NaN or Inf.
I am sure there are multiple instances with just one bit pattern, the same as other floats. Otherwise, float('nan') would have to either randomly or systematically choose from among the possibilities. Ugh.
I think Glenn is asking whether NANs are singletons. They're not:
x = float('nan') y = float('nan') x is y False [x] == [y] False
There are functions in the math module that pull apart (and put together) floats.
The IEEE 754 spec is clear that there are multiple bit sequences that can be used to represent these,
Anyone actually interested in those should use C or possibly the math module float assembly function.
I'd like to point out that way back in the 1980s, Apple's Hypercard allowed users to construct, and compare, distinct NANs without needing to use C or check bit patterns. I think it is painful and ironic that a development system aimed at non-programmers released by a company notorious for "dumbing down" interfaces over 20 years ago had better and simpler support for NANs than we have now. -- Steven
On Wed, Apr 27, 2011 at 7:41 PM, Glenn Linderman
One issue that I don't fully understand: I know there is only one instance of None in Python, but I'm not sure where to discover whether there is only a single, or whether there can be multiple, instances of NaN or Inf. The IEEE 754 spec is clear that there are multiple bit sequences that can be used to represent these, so I would hope that there can be, in fact, more than one value containing NaN (and Inf).
This would properly imply that a collection should correctly handle the case of storing multiple, different items using different NaN (and Inf) instances. A dict, for example, should be able to hold hundreds of items with the index value of NaN.
The distinction between "is" and "==" would permit proper operation, and I believe that Python's "rebinding" of names to values rather than the copying of values to variables makes such a distinction possible to use in a correct manner.
For infinities, there's no issue: there are exactly two distinct infinities (+inf and -inf), and they don't have any special properties that affect membership tests. Your float-keyed dict can contain both +inf and -inf keys, or just one, or neither, in exactly the same way that it can contain both +5.0 and -5.0 as keys, or just one, or neither. For nans, you *can* put multiple nans into a dictionary as separate keys, but under the current rules the test for 'sameness' of two nan keys becomes a test of object identity, not of bitwise equality. Python takes no notice of the sign bits and 'payload' bits of a float nan, except in operations like struct.pack and struct.unpack. For example:
x, y = float('nan'), float('nan') d = {x: 1, y:2} x in d True y in d True d[x] 1 d[y] 2
But using struct.pack, you can see that x and y are bitwise identical:
struct.pack('
Mark
On 4/27/2011 2:15 PM, Mark Dickinson wrote:
One issue that I don't fully understand: I know there is only one instance of None in Python, but I'm not sure where to discover whether there is only a single, or whether there can be multiple, instances of NaN or Inf. The IEEE 754 spec is clear that there are multiple bit sequences that can be used to represent these, so I would hope that there can be, in fact, more than one value containing NaN (and Inf).
This would properly imply that a collection should correctly handle the case of storing multiple, different items using different NaN (and Inf) instances. A dict, for example, should be able to hold hundreds of items with the index value of NaN.
The distinction between "is" and "==" would permit proper operation, and I believe that Python's "rebinding" of names to values rather than the copying of values to variables makes such a distinction possible to use in a correct manner. For infinities, there's no issue: there are exactly two distinct infinities (+inf and -inf), and they don't have any special properties
On Wed, Apr 27, 2011 at 7:41 PM, Glenn Linderman
wrote: that affect membership tests. Your float-keyed dict can contain both +inf and -inf keys, or just one, or neither, in exactly the same way that it can contain both +5.0 and -5.0 as keys, or just one, or neither. For nans, you *can* put multiple nans into a dictionary as separate keys, but under the current rules the test for 'sameness' of two nan keys becomes a test of object identity, not of bitwise equality. Python takes no notice of the sign bits and 'payload' bits of a float nan, except in operations like struct.pack and struct.unpack. For example: Thanks, Mark, for the succinct description and demonstration. Yes, only two Inf values, many possible NaNs. And this is what I would expect.
I would not, however expect the original case that was described:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
Glenn Linderman writes:
I would not, however expect the original case that was described:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
Are you saying you would expect that
nan = float('nan') a = [1, ..., 499, nan, 501, ..., 999] # meta-ellipsis, not Ellipsis a == a False
?? I wouldn't even expect
a = [1, ..., 499, float('nan'), 501, ..., 999] b = [1, ..., 499, float('nan'), 501, ..., 999] a == b False
but I guess I have to live with that.<wink> While I wouldn't apply it to other people, I have to admit Raymond's aphorism applies to me (the surprising thing is not the behavior of NaNs, but that I'm surprised by anything that happens in the presence of NaNs!)
On 4/27/2011 7:31 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
I would not, however expect the original case that was described:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
Are you saying you would expect that
nan = float('nan') a = [1, ..., 499, nan, 501, ..., 999] # meta-ellipsis, not Ellipsis a == a False
??
Yes, absolutely. Once you understand the definition of NaN, it certainly cannot be True. a is a, but a is not equal to a.
I wouldn't even expect
a = [1, ..., 499, float('nan'), 501, ..., 999] b = [1, ..., 499, float('nan'), 501, ..., 999] a == b False
but I guess I have to live with that.<wink> While I wouldn't apply it to other people, I have to admit Raymond's aphorism applies to me (the surprising thing is not the behavior of NaNs, but that I'm surprised by anything that happens in the presence of NaNs!)
The only thing that should happen in the presence of NaNs is more NaNs :)
On 04/28/2011 04:31 AM, Stephen J. Turnbull wrote:
Are you saying you would expect that
nan = float('nan') a = [1, ..., 499, nan, 501, ..., 999] # meta-ellipsis, not Ellipsis a == a False
??
I would expect l1 == l2, where l1 and l2 are both lists, to be semantically equivalent to len(l1) == len(l2) and all(imap(operator.eq, l1, l2)). Currently it isn't, and that was the motivation for this thread. If objects that break reflexivity of == are not allowed, this should be documented, and such objects banished from the standard library. Hrvoje
On 4/27/2011 11:31 AM, Nick Coghlan wrote:
Currently, Python tries to split the difference: "==" and "!=" follow IEEE754 for NaN, but most other operations involving builtin types rely on the assumption that equality is always reflexive (and IEEE754 be damned).
What that means is that "correct" implementations of methods like __contains__, __eq__, __ne__, index() and count() on containers should be using "x is y or x == y" to enforce reflexivity, but most such code does not (e.g. our own collections.abc.Sequence implementation gets those of these that it implements wrong, and hence Sequence based containers will handle NaN in a way that differs from the builtin containers)
And none of that is actually documented anywhere (other than a behavioural note in the 3.x documentation for PyObject_RichCompareBool), so it's currently just an implementation detail of CPython that most of the builtin containers behave that way in practice.
Which is why I proposed a Glossary entry in another post.
Given the status quo, what would seem to be the path of least resistance is to: - articulate in the language specification which container special methods are expected to enforce reflexivity of equality (even for non-reflexive types) - articulate in the library specification which ordinary container methods enforce reflexivity of equality - fix any standard library containers that don't enforce reflexivity to do so where appropriate (e.g. collections.abc.Sequence)
+1 to making my proposed text consistenly true if not now ;-).
Backing away from having float and decimal.Decimal respect the IEEE754 notion of NaN inequality at this late stage of the game seems like one for the "too hard" basket.
Robert Kern confirmed my suspicion about this relative to numpy.
It also wouldn't achieve much, since we want the builtin containers to preserve their invariants even for 3rd party types with a non-reflexive notion of equality.
Good point. -- Terry Jan Reedy
On Wed, Apr 27, 2011 at 10:53 AM, Guido van Rossum
Maybe we should just call off the odd NaN comparison behavior?
+1 There was a long thread on this topic last year: http://mail.python.org/pipermail/python-dev/2010-March/098832.html I was trying to find a rationale for non-reflexivity of equality in IEEE and although it is often mentioned that this property simplifies some numerical algorithms, I am yet to find an important algorithm that would benefit from it. I also believe that long history of suboptimal hardware implementations of nan arithmetics has stifled the development of practical applications. High performance applications that rely on non-reflexivity will still have an option of using ctypes.c_float type or NumPy.
On Thu, Apr 28, 2011 at 1:43 AM, Alexander Belopolsky
High performance applications that rely on non-reflexivity will still have an option of using ctypes.c_float type or NumPy.
However, that's exactly the reason I don't see any reason to reverse course on having float() and Decimal() follow IEEE754 semantics, regardless of how irritating we may find those semantics to be. Since we allow types to customise __eq__ and __ne__ with non-standard behaviour, if we want to permit *any* type to have a non-reflexive notion of equality, then we need to write our container types to enforce reflexivity when appropriate. Many of the builtin types already do this, by virtue of it being built in to RichCompareBool. It's now a matter of documenting that properly and updating the non-conformant types accordingly. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, 27 Apr 2011, Alexander Belopolsky wrote:
High performance applications that rely on non-reflexivity will still have an option of using ctypes.c_float type or NumPy.
Python could also provide IEEE-754 equality as a function (perhaps in "math"), something like: def ieee_equal (a, b): return a == b and not isnan (a) and not isnan (b) Of course, the definition of math.isnan cannot then be by checking its argument by comparison with itself - it would have to check the appropriate bits of the float representation. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
On Wed, Apr 27, 2011 at 12:05 PM, Isaac Morland
Of course, the definition of math.isnan cannot then be by checking its argument by comparison with itself - it would have to check the appropriate bits of the float representation.
math.isnan() is implemented in C and does not rely on float.__eq__ in any way.
On Wed, 27 Apr 2011 12:05:12 -0400 (EDT)
Isaac Morland
On Wed, 27 Apr 2011, Alexander Belopolsky wrote:
High performance applications that rely on non-reflexivity will still have an option of using ctypes.c_float type or NumPy.
Python could also provide IEEE-754 equality as a function (perhaps in "math"), something like:
def ieee_equal (a, b): return a == b and not isnan (a) and not isnan (b)
+1 (perhaps call it math.eq()). Regards Antoine.
On Wed, 27 Apr 2011, Antoine Pitrou wrote:
Isaac Morland
wrote: Python could also provide IEEE-754 equality as a function (perhaps in "math"), something like:
def ieee_equal (a, b): return a == b and not isnan (a) and not isnan (b)
+1 (perhaps call it math.eq()).
Alexander Belopolsky pointed out to me (thanks!) that isnan is implemented in C so my caveat about the implementation of isnan is not an issue. But then that made me realize the ieee_equal (or just "eq" if that's preferable) probably ought to be implemented in C using a floating point comparison - i.e., use the processor implementation of the comparison operation.. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
On Apr 27, 2011, at 7:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
I'm reluctant to suggest changing such enshrined behavior. ISTM, the current state of affairs is reasonable. Exotic objects are allowed to generate exotic behaviors but consumers of those objects are free to ignore some of those behaviors by making reasonable assumptions about how an object should behave. It's possible to make objects where the __hash__ doesn't correspond to __eq__.; they just won't behave well with hash tables. Likewise, it's possible for a sequence to define a __len__ that is different from it true length; it just won't behave well with the various pieces of code that assume collections are equal if the lengths are unequal. All of this seems reasonable to me. Raymond
On Wed, Apr 27, 2011 at 12:28 PM, Raymond Hettinger
On Apr 27, 2011, at 7:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
I'm reluctant to suggest changing such enshrined behavior.
ISTM, the current state of affairs is reasonable. Exotic objects are allowed to generate exotic behaviors but consumers of those objects are free to ignore some of those behaviors by making reasonable assumptions about how an object should behave.
Unfortunately NaNs are not that exotic. They can be silently produced in calculations and lead to hard to find errors. For example:
x = 1e300*1e300 x - x nan
This means that every program dealing with float data has to detect nans at every step and handle them correctly. This in turn makes it impossible to write efficient code that works equally well with floats and integers. Note that historically, Python was trying hard to prevent production of non-finite floats. AFAICT, none of the math functions would produce inf or nan. I am not sure why arithmetic operations are different. For example:
1e300*1e300 inf
but
1e300**2 Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: (34, 'Result too large')
and
math.pow(1e300,2) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: math range error
On Wed, Apr 27, 2011 at 9:28 AM, Raymond Hettinger
On Apr 27, 2011, at 7:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
I'm reluctant to suggest changing such enshrined behavior.
No doubt there would be some problems; probably more for decimals than for floats.
ISTM, the current state of affairs is reasonable.
Hardly; when I picked the NaN behavior I knew the IEEE std prescribed it but had never seen any code that used this.
Exotic objects are allowed to generate exotic behaviors but consumers of those objects are free to ignore some of those behaviors by making reasonable assumptions about how an object should behave.
I'd say that the various issues and inconsistencies brought up (e.g. x in A even though no a in A equals x) make it clear that one ignores NaN's exoticnesss at one's peril.
It's possible to make objects where the __hash__ doesn't correspond to __eq__.; they just won't behave well with hash tables.
That's not the same thing at all. Such an object would violate a rule of the language (although one that Python cannot strictly enforce) and it would always be considered a bug. Currently NaN is not violating any language rules -- it is just violating users' intuition, in a much worse way than Inf does. (All in all, Inf behaves pretty intuitively, at least for someone who was awake during at least a few high school math classes. NaN is not discussed there. :-)
Likewise, it's possible for a sequence to define a __len__ that is different from it true length; it just won't behave well with the various pieces of code that assume collections are equal if the lengths are unequal.
(you probably meant "are never equal") Again, typically a bug.
All of this seems reasonable to me.
Given the IEEE std and Python's history, it's defensible and hard to change, but still, I find reasonable too strong a word for the situation. I expect that that if 15 years or so ago I had decided to ignore the IEEE std and declare that object identity always implies equality it would have seemed quite reasonable as well... The rule could be something like "the == operator first checks for identity and if left and right are the same object, the answer is True without calling the object's __eq__ method; similarly the != would always return False when an object is compared to itself". We wouldn't change the inequalities, nor the outcome if a NaN is compared to another NaN (not the same object). But we would extend the special case for object identity from containers to all == and != operators. (Currently it seems that all NaNs have a hash() of 0. That hasn't hurt anyone so far.) Doing this in 3.3 would, alas, be a huge undertaking -- I expect that there are tons of unittests that depend either on the current NaN behavior or on x == x calling x.__eq__(x). Plus the decimal unittests would be affected. Perhaps somebody could try? -- --Guido van Rossum (python.org/~guido)
On Wed, Apr 27, 2011 at 11:14 PM, Guido van Rossum
ISTM, the current state of affairs is reasonable.
Hardly; when I picked the NaN behavior I knew the IEEE std prescribed it but had never seen any code that used this.
Same here. The only code I've seen that depended on this NaN behavior was either buggy (programmer did not consider NaN case) or was using x == x as a way to detect nans. The later idiom is universally frowned upon regardless of the language. In Python one should use math.isnan() for this purpose. I would like to present a challenge to the proponents of the status quo. Look through your codebase and find code that will behave differently if nan == nan were True. Then come back and report how many bugs you have found. :-) Seriously, though, I bet that if you find anything, it will fall into one of the two cases I mentioned above. ..
I expect that that if 15 years or so ago I had decided to ignore the IEEE std and declare that object identity always implies equality it would have seemed quite reasonable as well... The rule could be something like "the == operator first checks for identity and if left and right are the same object, the answer is True without calling the object's __eq__ method; similarly the != would always return False when an object is compared to itself".
Note that ctypes' floats already behave this way:
x = c_double(float('nan')) x == x True
..
Doing this in 3.3 would, alas, be a huge undertaking -- I expect that there are tons of unittests that depend either on the current NaN behavior or on x == x calling x.__eq__(x). Plus the decimal unittests would be affected. Perhaps somebody could try?
Before we go down this path, I would like to discuss another peculiarity of NaNs:
float('nan') < 0 False float('nan') > 0 False
This property in my experience causes much more trouble than nan == nan being false. The problem is that common sorting or binary search algorithms may degenerate into infinite loops in the presence of nans. This may even happen when searching for a finite value in a large array that contains a single nan. Errors like this do happen in the wild and and after chasing a bug like this programmers tend to avoid nans at all costs. Oftentimes this leads to using "magic" placeholders such as 1e300 for missing data. Since py3k has already made None < 0 an error, it may be reasonable for float('nan') < 0 to raise an error as well (probably ValueError rather than TypeError). This will not make lists with nans sortable or searchable using binary search, but will make associated bugs easier to find.
Guido van Rossum wrote:
Currently NaN is not violating any language rules -- it is just violating users' intuition, in a much worse way than Inf does.
If it's to be an official language non-rule (by which I mean that types are officially allowed to compare non-reflexively) then any code assuming that identity implies equality for arbitrary objects is broken and should be fixed. -- Greg
On 4/27/2011 10:53 AM, Guido van Rossum wrote:
On Wed, Apr 27, 2011 at 7:39 AM, Raymond Hettinger
Identity-implies-equality is necessary so that classes can maintain their invariants and so that programmers can reason about their code. [snip] See http://bertrandmeyer.com/2010/02/06/reflexivity-and-other-pillars-of-civiliz... for a nice blog post on the subject.
I carefully reread this, with the comments, and again came to the conclusion that the committee left us no *good* answer, only a choice between various more-or-less unsatifactory answers. The current Python compromise may be as good as anything. In any case, I think it should be explicitly documented with an indexed paragraph, perhaps as follows: "The IEEE-754 committee defined the float Not_a_Number (NaN) value as being incomparable with all others floats, including itself. This violates the math and logic rule that equality is reflexive, that 'a == a' is always True. And Python collection classes depend on that rule for their proper operation. So Python makes the follow compromise. Direct equality comparisons involving Nan, such as "NaN=float('NaN'); NaN == ob", follow the IEEE-754 rule and return False. Indirect comparisons conducted internally as part of a collection operation, such as 'NaN in someset' or 'seq.count()' or 'somedict[x]', follow the reflexive rule and act as it 'Nan == NaN' were True. Most Python programmers will never see a Nan in real programs." This might best be an entry in the Glossary under "NaN -- Not a Number". It should be the first reference for Nan in the General Index and linked to from the float() builtin and float type Nan mentions.
Maybe we should just call off the odd NaN comparison behavior?
Eiffel seems to have survived, though I do not know if it used for numerical work. I wonder how much code would break and what the scipy folks would think. 3.0 would have been the time, though. -- Terry Jan Reedy
On 4/27/11 12:44 PM, Terry Reedy wrote:
On 4/27/2011 10:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
Eiffel seems to have survived, though I do not know if it used for numerical work. I wonder how much code would break and what the scipy folks would think.
I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Apr 27, 2011 at 11:48 AM, Robert Kern
On 4/27/11 12:44 PM, Terry Reedy wrote:
On 4/27/2011 10:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
Eiffel seems to have survived, though I do not know if it used for numerical work. I wonder how much code would break and what the scipy folks would think.
I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward.
So does NumPy also follow Python's behavior about ignoring the NaN special-casing when doing array ops? -- --Guido van Rossum (python.org/~guido)
On 2011-04-27 22:16 , Guido van Rossum wrote:
On Wed, Apr 27, 2011 at 11:48 AM, Robert Kern
wrote: On 4/27/11 12:44 PM, Terry Reedy wrote:
On 4/27/2011 10:53 AM, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
Eiffel seems to have survived, though I do not know if it used for numerical work. I wonder how much code would break and what the scipy folks would think.
I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward.
So does NumPy also follow Python's behavior about ignoring the NaN special-casing when doing array ops?
By "ignoring the NaN special-casing", do you mean that identity is checked first? When we use dtype=object arrays (arrays that contain Python objects as their data), yes: [~] |1> nan = float('nan') [~] |2> import numpy as np [~] |3> a = np.array([1, 2, nan], dtype=object) [~] |4> nan in a True [~] |5> float('nan') in a False Just like lists: [~] |6> nan in [1, 2, nan] True [~] |7> float('nan') in [1, 2, nan] False Actually, we go a little further by using PyObject_RichCompareBool() rather than PyObject_RichCompare() to implement the array-wise comparisons in addition to containment: [~] |8> a == nan array([False, False, True], dtype=bool) [~] |9> [x == nan for x in [1, 2, nan]] [False, False, False] But for dtype=float arrays (which contain C doubles, not Python objects) we use C semantics. Literally, we use whatever C's == operator gives us for the two double values. Since there is no concept of identity for this case, there is no cognate behavior of Python to match. [~] |10> b = np.array([1.0, 2.0, nan], dtype=float) [~] |11> b == nan array([False, False, False], dtype=bool) [~] |12> nan in b False -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Apr 27, 2011 at 8:42 PM, Robert Kern
On 2011-04-27 22:16 , Guido van Rossum wrote:
So does NumPy also follow Python's behavior about ignoring the NaN special-casing when doing array ops?
By "ignoring the NaN special-casing", do you mean that identity is checked first? When we use dtype=object arrays (arrays that contain Python objects as their data), yes:
[~] |1> nan = float('nan')
[~] |2> import numpy as np
[~] |3> a = np.array([1, 2, nan], dtype=object)
[~] |4> nan in a True
[~] |5> float('nan') in a False
Just like lists:
[~] |6> nan in [1, 2, nan] True
[~] |7> float('nan') in [1, 2, nan] False
Actually, we go a little further by using PyObject_RichCompareBool() rather than PyObject_RichCompare() to implement the array-wise comparisons in addition to containment:
[~] |8> a == nan array([False, False, True], dtype=bool)
Hm, this sounds like NumPy always considers a NaN equal to *itself* as long as objects are concerned.
[~] |9> [x == nan for x in [1, 2, nan]] [False, False, False]
But for dtype=float arrays (which contain C doubles, not Python objects) we use C semantics. Literally, we use whatever C's == operator gives us for the two double values. Since there is no concept of identity for this case, there is no cognate behavior of Python to match.
[~] |10> b = np.array([1.0, 2.0, nan], dtype=float)
[~] |11> b == nan array([False, False, False], dtype=bool)
[~] |12> nan in b False
And I wouldn't want to change that. It sounds like NumPy wouldn't be much affected if we were to change this (which I'm not saying we would). Thanks! -- --Guido van Rossum (python.org/~guido)
On 2011-04-27 23:01 , Guido van Rossum wrote:
On Wed, Apr 27, 2011 at 8:42 PM, Robert Kern
wrote:
But for dtype=float arrays (which contain C doubles, not Python objects) we use C semantics. Literally, we use whatever C's == operator gives us for the two double values. Since there is no concept of identity for this case, there is no cognate behavior of Python to match.
[~] |10> b = np.array([1.0, 2.0, nan], dtype=float)
[~] |11> b == nan array([False, False, False], dtype=bool)
[~] |12> nan in b False
And I wouldn't want to change that. It sounds like NumPy wouldn't be much affected if we were to change this (which I'm not saying we would).
Well, I didn't say that. If Python changed its behavior for (float('nan') == float('nan')), we'd have to seriously consider some changes. We do like to keep *some* amount of correspondence with Python semantics. In particular, we like our scalar types that match Python types to work as close to the Python type as possible. We have the np.float64 type, which represents a C double scalar and corresponds to a Python float. It is used when a single item is indexed out of a float64 array. We even subclass from the Python float type to help working with libraries that may not know about numpy: [~] |5> import numpy as np [~] |6> nan = np.array([1.0, 2.0, float('nan')])[2] [~] |7> nan == nan False [~] |8> type(nan) numpy.float64 [~] |9> type(nan).mro() [numpy.float64, numpy.floating, numpy.inexact, numpy.number, numpy.generic, float, object] If the Python float type changes behavior, we'd have to consider whether to keep that for np.float64 or change it to match the usual C semantics used elsewhere. So there *would* be a dilemma. Not necessarily the most nerve-wracking one, but a dilemma nonetheless. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Apr 27, 2011 at 9:25 PM, Robert Kern
On 2011-04-27 23:01 , Guido van Rossum wrote:
And I wouldn't want to change that. It sounds like NumPy wouldn't be much affected if we were to change this (which I'm not saying we would).
Well, I didn't say that. If Python changed its behavior for (float('nan') == float('nan')), we'd have to seriously consider some changes.
Ah, but I'm not proposing anything of the sort! float('nan') returns a new object each time and two NaNs that are not the same *object* will still follow the IEEE std. It's just when comparing a NaN-valued *object* to *itself* (i.e. the *same* object) that I would consider following the lead of Python's collections.
We do like to keep *some* amount of correspondence with Python semantics. In particular, we like our scalar types that match Python types to work as close to the Python type as possible. We have the np.float64 type, which represents a C double scalar and corresponds to a Python float. It is used when a single item is indexed out of a float64 array. We even subclass from the Python float type to help working with libraries that may not know about numpy:
[~] |5> import numpy as np
[~] |6> nan = np.array([1.0, 2.0, float('nan')])[2]
[~] |7> nan == nan False
Yeah, this is where things might change, because it is the same *object* left and right.
[~] |8> type(nan) numpy.float64
[~] |9> type(nan).mro() [numpy.float64, numpy.floating, numpy.inexact, numpy.number, numpy.generic, float, object]
If the Python float type changes behavior, we'd have to consider whether to keep that for np.float64 or change it to match the usual C semantics used elsewhere. So there *would* be a dilemma. Not necessarily the most nerve-wracking one, but a dilemma nonetheless.
Given what I just said, would it still be a dilemma? Maybe a smaller one? -- --Guido van Rossum (python.org/~guido)
On Thu, Apr 28, 2011 at 2:54 PM, Guido van Rossum
Well, I didn't say that. If Python changed its behavior for (float('nan') == float('nan')), we'd have to seriously consider some changes.
Ah, but I'm not proposing anything of the sort! float('nan') returns a new object each time and two NaNs that are not the same *object* will still follow the IEEE std. It's just when comparing a NaN-valued *object* to *itself* (i.e. the *same* object) that I would consider following the lead of Python's collections.
The reason this possibility bothers me is that it doesn't mesh well with the "implementations are free to cache and reuse immutable objects" rule. Although, if the updated NaN semantics were explicit that identity was now considered part of the value of NaN objects (thus ruling out caching them at the implementation layer), I guess that objection would go away. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/27/11 11:54 PM, Guido van Rossum wrote:
On Wed, Apr 27, 2011 at 9:25 PM, Robert Kern
wrote: On 2011-04-27 23:01 , Guido van Rossum wrote:
And I wouldn't want to change that. It sounds like NumPy wouldn't be much affected if we were to change this (which I'm not saying we would).
Well, I didn't say that. If Python changed its behavior for (float('nan') == float('nan')), we'd have to seriously consider some changes.
Ah, but I'm not proposing anything of the sort! float('nan') returns a new object each time and two NaNs that are not the same *object* will still follow the IEEE std. It's just when comparing a NaN-valued *object* to *itself* (i.e. the *same* object) that I would consider following the lead of Python's collections.
Ah, I see!
We do like to keep *some* amount of correspondence with Python semantics. In particular, we like our scalar types that match Python types to work as close to the Python type as possible. We have the np.float64 type, which represents a C double scalar and corresponds to a Python float. It is used when a single item is indexed out of a float64 array. We even subclass from the Python float type to help working with libraries that may not know about numpy:
[~] |5> import numpy as np
[~] |6> nan = np.array([1.0, 2.0, float('nan')])[2]
[~] |7> nan == nan False
Yeah, this is where things might change, because it is the same *object* left and right.
[~] |8> type(nan) numpy.float64
[~] |9> type(nan).mro() [numpy.float64, numpy.floating, numpy.inexact, numpy.number, numpy.generic, float, object]
If the Python float type changes behavior, we'd have to consider whether to keep that for np.float64 or change it to match the usual C semantics used elsewhere. So there *would* be a dilemma. Not necessarily the most nerve-wracking one, but a dilemma nonetheless.
Given what I just said, would it still be a dilemma? Maybe a smaller one?
Smaller, certainly. But now it's a trilemma. :-) 1. Have just np.float64 and np.complex128 scalars follow the Python float semantics since they subclass Python float and complex, respectively. 2. Have all np.float* and np.complex* scalars follow the Python float semantics. 3. Keep the current IEEE-754 semantics for all float scalar types. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Apr 27, 2011 at 2:48 PM, Robert Kern
I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward.
Most NumPy applications are actually not exposed to NaN problems because it is recommended that NaNs be avoided in computations and when missing or undefined values are necessary, the recommended solution is to use ma.array or masked array which is a drop-in replacement for numpy array type and carries a boolean "mask" value with every element. This allows to have undefined elements is arrays of any type: float, integer or even boolean. Masked values propagate through all computations including comparisons.
On Wed, Apr 27, 2011 at 9:15 PM, Alexander Belopolsky
On Wed, Apr 27, 2011 at 2:48 PM, Robert Kern
wrote: .. I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward.
Most NumPy applications are actually not exposed to NaN problems because it is recommended that NaNs be avoided in computations and when missing or undefined values are necessary, the recommended solution is to use ma.array or masked array which is a drop-in replacement for numpy array type and carries a boolean "mask" value with every element. This allows to have undefined elements is arrays of any type: float, integer or even boolean. Masked values propagate through all computations including comparisons.
So do new masks get created when the outcome of an elementwise operation is a NaN? Because that's the only reason why one should have NaNs in one's data in the first place -- not to indicate missing values! -- --Guido van Rossum (python.org/~guido)
On 2011-04-27 23:24 , Guido van Rossum wrote:
On Wed, Apr 27, 2011 at 9:15 PM, Alexander Belopolsky
wrote: On Wed, Apr 27, 2011 at 2:48 PM, Robert Kern
wrote: .. I suspect most of us would oppose changing it on general backwards-compatibility grounds rather than actually *liking* the current behavior. If the behavior changed with Python floats, we'd have to mull over whether we try to match that behavior with our scalar types (one of which subclasses from float) and our arrays. We would be either incompatible with Python or C, and we'd probably end up choosing Python to diverge from. It would make a mess, honestly. We already have to explain why equality is funky for arrays (arr1 == arr2 is a rich comparison that gives an array, not a bool, so we can't do containment tests for lists of arrays), so NaN is pretty easy to explain afterward.
Most NumPy applications are actually not exposed to NaN problems because it is recommended that NaNs be avoided in computations and when missing or undefined values are necessary, the recommended solution is to use ma.array or masked array which is a drop-in replacement for numpy array type and carries a boolean "mask" value with every element. This allows to have undefined elements is arrays of any type: float, integer or even boolean. Masked values propagate through all computations including comparisons.
So do new masks get created when the outcome of an elementwise operation is a NaN?
No.
Because that's the only reason why one should have NaNs in one's data in the first place -- not to indicate missing values!
Yes. I'm not sure that Alexander was being entirely clear. Masked arrays are intended to solve just the missing data problem and not the occurrence of NaNs from computations. There is still a persistent part of the community that really does like to use NaNs for missing data, though. I don't think that's entirely relevant to this discussion[1]. I wouldn't say that numpy applications aren't exposed to NaN problems. They are just as exposed to computational NaNs as you would expect any application that does that many flops to be. [1] Okay, that's a lie. I'm sure that persistent minority would *love* to have NaN == NaN, because that would make their (ab)use of NaNs easier to work with. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Apr 27, 2011 at 9:33 PM, Robert Kern
[1] Okay, that's a lie. I'm sure that persistent minority would *love* to have NaN == NaN, because that would make their (ab)use of NaNs easier to work with.
Too bad, because that won't change. :-) I agree that this is abuse of NaNs and shouldn't be encouraged. -- --Guido van Rossum (python.org/~guido)
On Thu, Apr 28, 2011 at 12:33 AM, Robert Kern
On 2011-04-27 23:24 , Guido van Rossum wrote: ..
So do new masks get created when the outcome of an elementwise operation is a NaN?
No.
Yes.
from MA import array print array([0])/array([0]) [-- ]
(I don't have numpy on this laptop, so the example is using Numeric, but I hope you guys did not change that while I was not looking:-)
On 4/28/11 12:37 AM, Alexander Belopolsky wrote:
On Thu, Apr 28, 2011 at 12:33 AM, Robert Kern
wrote: On 2011-04-27 23:24 , Guido van Rossum wrote: ..
So do new masks get created when the outcome of an elementwise operation is a NaN?
No.
Yes.
from MA import array print array([0])/array([0]) [-- ]
(I don't have numpy on this laptop, so the example is using Numeric, but I hope you guys did not change that while I was not looking:-)
This behavior is not what you think it is. Rather, some binary operations have been augmented with a domain of validity, and the results will be masked out when the domain is violated. Division is one of them, and division by zero will cause the result to be masked. You can produce NaNs in other ways that will not be masked in both numpy and old Numeric: [~] |4> minf = np.ma.array([1e300]) * np.ma.array([1e300]) Warning: overflow encountered in multiply [~] |5> minf masked_array(data = [ inf], mask = False, fill_value = 1e+20) [~] |6> minf - minf masked_array(data = [ nan], mask = False, fill_value = 1e+20) [~] |14> import MA [~] |15> minf = MA.array([1e300]) * MA.array([1e300]) [~] |16> minf array([ inf,]) [~] |17> (minf - minf)[0] nan [~] |25> (minf - minf)._mask is None True Numeric has a bug where it cannot print arrays with NaNs, so I just grabbed the element out instead of showing it. But I guarantee you that it is not masked. Masked arrays are not a way to avoid NaNs arising from computations. NaN handling is an important part of computing with numpy. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[This is a mega-reply, combining responses to several messages in this
thread. I may be repeating myself a bit, but I think I am being
consistent. :-)]
On Wed, Apr 27, 2011 at 10:12 PM, Nick Coghlan
On Thu, Apr 28, 2011 at 2:54 PM, Guido van Rossum
wrote: Well, I didn't say that. If Python changed its behavior for (float('nan') == float('nan')), we'd have to seriously consider some changes.
Ah, but I'm not proposing anything of the sort! float('nan') returns a new object each time and two NaNs that are not the same *object* will still follow the IEEE std. It's just when comparing a NaN-valued *object* to *itself* (i.e. the *same* object) that I would consider following the lead of Python's collections.
The reason this possibility bothers me is that it doesn't mesh well with the "implementations are free to cache and reuse immutable objects" rule. Although, if the updated NaN semantics were explicit that identity was now considered part of the value of NaN objects (thus ruling out caching them at the implementation layer), I guess that objection would go away.
The rules for float could be expanded to disallow NaN caching.
But even if we didn't change any rules, reusing immutable objects
could currently make computations undefined, because container
comparisons use the "identity wins" rule. E.g. if we didn't change the
rule for nan==nan, but we did change float("nan") to always return a
specific singleton, comparisons like [float("nan")] == [float("nan")]
would change in outcome. (Note that not all NaNs could be the same
object, since there are multiple bit patterns meaning NaN; IIUC this
is different from Inf.)
All this makes me realize that there would be another issue, one that
I wouldn't know how to deal with: a JITting interpreter could
translate code involving floats into machine code, at which point
object identity would be lost (presumably the machine code would use
IEEE value semantics for NaN).
This also reminds me that the current "identity wins" rules for
containers, combined with the "NaN==NaN is always False" for
non-container contexts, theoretically also might pose constraints on
the correctness of certain JIT optimizations. I don't know if PyPy
optimizes any code involving tuples or lists of floats, so I don't
know if it is a problem in practice, but it does seem to pose a
complex constraint in theory.
TBH Whatever Raymond may say, I have never been a fan of the "identity
wins" rules for containers given that we don't have a corresponding
rule requiring __eq__ to return True for x.__eq__(x).
On Wed, Apr 27, 2011 at 10:27 PM, Alexander Belopolsky
Note that ctypes' floats already behave this way:
x = c_double(float('nan')) x == x True
But ctypes floats are not numbers. I don't think this provides any evidence (except of possibly a shortcut in the ctypes implementation for == :-).
Before we go down this path, I would like to discuss another peculiarity of NaNs:
float('nan') < 0 False float('nan') > 0 False
This property in my experience causes much more trouble than nan == nan being false. The problem is that common sorting or binary search algorithms may degenerate into infinite loops in the presence of nans. This may even happen when searching for a finite value in a large array that contains a single nan. Errors like this do happen in the wild and and after chasing a bug like this programmers tend to avoid nans at all costs. Oftentimes this leads to using "magic" placeholders such as 1e300 for missing data.
Since py3k has already made None < 0 an error, it may be reasonable for float('nan') < 0 to raise an error as well (probably ValueError rather than TypeError). This will not make lists with nans sortable or searchable using binary search, but will make associated bugs easier to find.
Hmm... It feels like a much bigger can of worms and I'm not at all
sure that it is going to work out any better than the current behavior
(which can be coarsely characterized as "tough shit, float + {NaN} do
not form a total ordering" :-). Remember when some string comparisons
would raise exceptions if "uncomparable" Unicode and non-Unicode
values were involved? That was a major pain and we gladly killed that
in Py3k. (Though it was for ==/!=, not for < etc.)
Basically I think the IEEE std has probably done a decent job of
defining how NaNs should behave, with the exception of object identity
-- because the IEEE std does not deal with objects, only with values.
The only other thing that could perhaps work would be to disallow NaN
from ever being created, instead always raising an exception if NaN
would be produced. Like we do with division by zero. But that would be
a *huge* incompatible change to Python's floating point capabilities
and I'm not interested in going there. The *only* point where I think
we might have a real problem is the discrepancy between individual NaN
comparisons and container comparisons involving NaN (which take
identity into account in a way that individual comparisons don't).
On Wed, Apr 27, 2011 at 10:53 PM, Alexander Belopolsky
On Thu, Apr 28, 2011 at 12:24 AM, Guido van Rossum
wrote: So do new masks get created when the outcome of an elementwise operation is a NaN? Because that's the only reason why one should have NaNs in one's data in the first place.
If this is the case, why Python almost never produces NaNs as IEEE standard prescribes?
0.0/0.0 Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: float division
Even the IEEE std, AFAIK, lets you separately control what happens on zero division and on NaN-producing operations. Python has chosen to always raise an exception on zero division, and I don't think this violates the IEEE std.
-- not to indicate missing values!
Sometimes you don't have a choice. For example when you data comes from a database that uses NaNs for missing values.
I would choose to call that a bug in the database. It should use None, not NaN.
On Wed, Apr 27, 2011 at 11:07 PM, Greg Ewing
Guido van Rossum wrote:
Currently NaN is not violating any language rules -- it is just violating users' intuition, in a much worse way than Inf does.
If it's to be an official language non-rule (by which I mean that types are officially allowed to compare non-reflexively) then any code assuming that identity implies equality for arbitrary objects is broken and should be fixed.
Only if there's a use case for passing it NaNs.
On Wed, Apr 27, 2011 at 11:51 PM, Alexander Belopolsky
On Thu, Apr 28, 2011 at 2:20 AM, Glenn Linderman
wrote: .. In that bug, Nick, you mention that reflexive equality is something that container classes rely on in their implementation. Such reliance seems to me to be a bug, or an inappropriate optimization, ..
An alternative interpretation would be that it is a bug to use NaN values in lists.
This would be bad; the list shouldn't care what kind of objects can be stored in it.
It is certainly nonsensical to use NaNs as keys in dictionaries
But somehow it works, if you consider each NaN *object* as a different value. :-)
and that reportedly led Java designers to forgo the nonreflexivity of nans:
""" A "NaN" value is not equal to itself. However, a "NaN" Java "Float" object is equal to itself. The semantic is defined this way, because otherwise "NaN" Java "Float" objects cannot be retrieved from a hash table. """ - http://www.concentric.net/~ttwang/tech/javafloat.htm
That is exactly the change I am proposing (currently with a strength of +0) for Python, because Python's containers (at least the built-in ones) have already decided to follow this rule even if the float type itself has not yet.
With the status quo in Python, it may only make sense to store NaNs in array.array, but not in a list.
I do not see how this follows.
On Thu, Apr 28, 2011 at 12:57 AM, Nick Coghlan
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
The correct assertion under Python's current container semantics is:
if list(c1) == list(c2): # Make ordering assumption explicit assert all(x is y or x == y for x,y in zip(c1, c2)) # Enforce reflexivity
That does not apply to all containers and does not make much sense for any containers except those we call sequences (although there are different but similar rules for other categories of containers). And I think you meant it backwards: the second line is actually the (current) *definition* of sequence identity, it does not just follow from sequence identity. However, Python *used* to define sequence equality as plain elementwise equality, meaning that if nan==nan is always False, [nan]==[nan] would likewise be False. Raymond strongly believes that containers must be allowed to use the modified definition, I believe purely for performance reasons. (Without this rule, a list or tuple could not even cut short being compared to *itself*.) It seems you are in that camp too. I think that if the rule for containers is really that important, we should take the logical consequence and make a rule that a well-behaved type defines __eq__ and __ne__ to let object identity overrule whatever definition of value equality it has, and we should change float and decimal to follow this rule. (The "well-behaved" qualifier is intended to clarify that the language doesn't actually try to enforce this rule, similar to the existing rule about correspondence between __hash__ and __eq__.)
Meyer is a purist - sticking with the mathematical definition of equality is the sort of thing that fits his view of the world and what Eiffel should be, even if it hinders interoperability with other languages and tools. Python tends to be a bit more pragmatic about things, in particular when it comes to interoperability, so it makes sense to follow IEEE754 and the decimal specification at the individual comparison level.
So what *does* Eiffel do when comparing two NaNs from different sources? I would say that in this case, Python's approach started out as naive, not pragmatic -- I was (and still mostly am) clueless about all issues numeric. Augmenting float/decimal equality to let object identity win would be an example of pragmatic.
However, we can contain the damage to some degree by specifying that containers should enforce reflexivity where they need it. This is already the case at the implementation level (collections.Sequence aside), it just needs to be pushed up to the language definition level.
I think that when objects are involved, the word reflexivity does not convey the right intuition.
Can you give examples of algorithms that would break if one of your invariants is violated, but would still work if the data contains NaNs?
Sure, anything that cares more about objects than it does about values. The invariants are about making containers behave like containers as far as possible, even in the face of recalcitrant types like IEEE754 floating point.
TBH I think it's more about being allowed to take various shortcuts in
the implementation than about some abstract behavioral property. The
abstract behavioral property doesn't matter that much, but assuming it
enables the optimization, and the optimization does matter. Another
example of pragmatics.
On Thu, Apr 28, 2011 at 8:52 AM, Robert Kern
Smaller, certainly. But now it's a trilemma. :-)
1. Have just np.float64 and np.complex128 scalars follow the Python float semantics since they subclass Python float and complex, respectively. 2. Have all np.float* and np.complex* scalars follow the Python float semantics. 3. Keep the current IEEE-754 semantics for all float scalar types.
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved; and Python containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__. -- --Guido van Rossum (python.org/~guido)
Guido van Rossum wrote:
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved; and Python containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__.
I think that so long as "badly defined" objects are explicitly still permitted (with the understanding that they may behave badly in containers), and so long as NANs continue to be "badly behaved" in this sense, then I could live with that. It's really just formalizing the status quo as deliberate policy rather than an accident: nan == nan will still return False [nan] == [nan] will still return True. Purists on both sides will hate it :) -- Steven
On 4/28/2011 12:55 PM, Guido van Rossum wrote:
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved;
This, to me, is a statement of the obvious ;-), but it should be stated in the docs. Do you also propose to make NaNs at least this well-behaved or leave them ill-behaved?
and Python containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__.
This almost states the status quo of the implementation, and the doc needs to be updated correspondingly. I do not think we should let object ill-behavior infect containers, so that they also become ill-behaved (not equal to themselves). -- Terry Jan Reedy
On Thu, Apr 28, 2011 at 1:48 PM, Terry Reedy
On 4/28/2011 12:55 PM, Guido van Rossum wrote:
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved;
This, to me, is a statement of the obvious ;-), but it should be stated in the docs.
Do you also propose to make NaNs at least this well-behaved or leave them ill-behaved?
As I said, my proposal is to consider this a bug of the same severity as __hash__ and __eq__ disagreeing, and would require float and Decimal to be changed. The more conservative folks are in favor of not changing anything (except the abstract Sequence class), and solving things by documentation only. In that case the exotic current behavior of should not be considered a bug but merely unusual, and the behavior of collections (assuming an object is always equal to itself, never mind what its __eq__ says) documented as just that. There would not be any mention of well-behaved nor a judgment that NaN is not well-behaved. If my proposal is accepted, the definition of sequence comparison etc. would actually become simpler, since it should not have to mention the special-casing of object identity; instead it could mention the assumption of items being well-behaved. Again, the relationship between __eq__ and __hash__ would be the model here; and in fact a "well-behaved" type would have both properties (__eq__ returns true -> same __hash__, object identity -> __eq__ returns true). A type that is not well-behaved has a bug. I do not want to declare the behavior of NaN a bug.
and Python
containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__.
This almost states the status quo of the implementation, and the doc needs to be updated correspondingly. I do not think we should let object ill-behavior infect containers, so that they also become ill-behaved (not equal to themselves).
There are other kinds of bad behavior that will still affect containers. So we have no choice about containers containing ill-behaved objects being (potentially) ill-behaved. In some sense the primary issue at hand is whether "x == x returns False" indicates that x has a bug, or not. If it is a bug, the current float and Decimal types have that bug, and need to be fixed; and then the current behavior of containers is "merely' an optimization which may fail if there is a buggy item. The alternative is that we continue to say that it is not a bug, merely exotic, and that containers should test for identity before equality, not just as an optimization, but as the very essence of their semantics. The third option would be to say that the optimization is wrong. But nobody wants that, as it would require a container's __eq__ method to always compare all items before returning True, even when comparing a containing to *itself*. -- --Guido van Rossum (python.org/~guido)
On 4/28/11 11:55 AM, Guido van Rossum wrote:
On Thu, Apr 28, 2011 at 8:52 AM, Robert Kern
wrote: Smaller, certainly. But now it's a trilemma. :-)
1. Have just np.float64 and np.complex128 scalars follow the Python float semantics since they subclass Python float and complex, respectively. 2. Have all np.float* and np.complex* scalars follow the Python float semantics. 3. Keep the current IEEE-754 semantics for all float scalar types.
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved; and Python containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__.
*If* so, then we would then just have to decide between #2 and #3. With respect to this proposal, how does that interact with types that do not return bools for rich comparisons? For example, numpy arrays return bool arrays for comparisons. SQLAlchemy expressions return other SQLAlchemy expressions, etc. I realize that by being "not well-behaved" in this respect, we give up our rights to be proper elements in sortable, containment-checking containers. But in this and your followup message, you seem to be making a stronger statement that self.__eq__(self) not returning anything but True would be a bug in all contexts. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Apr 28, 2011 at 3:22 PM, Robert Kern
On 4/28/11 11:55 AM, Guido van Rossum wrote:
On Thu, Apr 28, 2011 at 8:52 AM, Robert Kern
wrote: Smaller, certainly. But now it's a trilemma. :-)
1. Have just np.float64 and np.complex128 scalars follow the Python float semantics since they subclass Python float and complex, respectively. 2. Have all np.float* and np.complex* scalars follow the Python float semantics. 3. Keep the current IEEE-754 semantics for all float scalar types.
*If* my proposal gets accepted, there will be a blanket rule that no matter how exotic an type's __eq__ is defined, self.__eq__(self) (i.e., __eq__ called with the same *object* argument) must return True if the type's __eq__ is to be considered well-behaved; and Python containers may assume (for the purpose of optimizing their own comparison operations) that their elements have a well-behaved __eq__.
*If* so, then we would then just have to decide between #2 and #3.
With respect to this proposal, how does that interact with types that do not return bools for rich comparisons? For example, numpy arrays return bool arrays for comparisons. SQLAlchemy expressions return other SQLAlchemy expressions, etc. I realize that by being "not well-behaved" in this respect, we give up our rights to be proper elements in sortable, containment-checking containers. But in this and your followup message, you seem to be making a stronger statement that self.__eq__(self) not returning anything but True would be a bug in all contexts.
Sorry, we'll have to make an exception for those of course. This will somewhat complicate the interpretation of well-behaved, because those are *not* well-behaved as far as containers go (both dict key lookup and list membership are affected) but it is not a bug -- however it is a bug to put these in containers and then use container comparisons on the container. -- --Guido van Rossum (python.org/~guido)
On Fri, Apr 29, 2011 at 8:55 AM, Guido van Rossum
Sorry, we'll have to make an exception for those of course. This will somewhat complicate the interpretation of well-behaved, because those are *not* well-behaved as far as containers go (both dict key lookup and list membership are affected) but it is not a bug -- however it is a bug to put these in containers and then use container comparisons on the container.
That's a point in favour of the status quo, though - with the burden of enforcing reflexivity placed on the containers, types are free to make use of rich comparisons to return more than just simple True/False results. I hadn't really thought about it that way before this discussion - it is the identity checking behaviour of the builtin containers that lets us sensibly handle cases like sets of NumPy arrays. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Apr 28, 2011 at 4:04 PM, Nick Coghlan
On Fri, Apr 29, 2011 at 8:55 AM, Guido van Rossum
wrote: Sorry, we'll have to make an exception for those of course. This will somewhat complicate the interpretation of well-behaved, because those are *not* well-behaved as far as containers go (both dict key lookup and list membership are affected) but it is not a bug -- however it is a bug to put these in containers and then use container comparisons on the container.
That's a point in favour of the status quo, though - with the burden of enforcing reflexivity placed on the containers, types are free to make use of rich comparisons to return more than just simple True/False results.
Possibly. Though for types that *do* return True/False, NaN's behavior can still be infuriating.
I hadn't really thought about it that way before this discussion - it is the identity checking behaviour of the builtin containers that lets us sensibly handle cases like sets of NumPy arrays.
But do they? For non-empty arrays, __eq__ will always return something that is considered true, so any hash collisions will cause false positives. And look at this simple example:
class C(list): ... def __eq__(self, other): ... if isinstance(other, C): ... return [x == y for x, y in zip(self, other)] ... a = C([1,2,3]) b = C([2,1,3]) a == b [False, False, True] x = [a, a] b in x True
-- --Guido van Rossum (python.org/~guido)
On Fri, Apr 29, 2011 at 9:13 AM, Guido van Rossum
I hadn't really thought about it that way before this discussion - it is the identity checking behaviour of the builtin containers that lets us sensibly handle cases like sets of NumPy arrays.
But do they? For non-empty arrays, __eq__ will always return something that is considered true, so any hash collisions will cause false positives. And look at this simple example:
class C(list): ... def __eq__(self, other): ... if isinstance(other, C): ... return [x == y for x, y in zip(self, other)] ... a = C([1,2,3]) b = C([2,1,3]) a == b [False, False, True] x = [a, a] b in x True
Hmm, true. And things like count() and index() would still be thoroughly broken for sequences. OK, so scratch that idea - there's simply no sane way to handle such objects without using an identity-based container that ignores equality definitions altogether. Pondering the NaN problem further, I think we can relatively easily argue that reflexive behaviour at the object level fits within the scope of IEEE754. 1. IEEE754 is a value-based system, with a finite number of distinct NaN payloads 2. Python is an object-based system. In addition to their payload, NaN objects are further distinguished by their identity (infinite in theory, in practice limited by available memory). 3. We can still technically be conformant with IEEE754 even if we say that a given NaN object is equivalent to itself, but not to other NaN objects with the same payload. Unfortunately, this still doesn't play well with serialisation, which assumes that the identity of float objects doesn't matter:
import pickle nan = float('nan') x = [nan, nan] x[0] is x[1] True y = pickle.loads(pickle.dumps(x)) y [nan, nan] y[0] is y[1] False
Contrast that with the handling of lists, where identity is known to be significant:
x = [[]]*2 x[0] is x[1] True y = pickle.loads(pickle.dumps(x)) y [[], []] y[0] is y[1] True
I'd say I've definitely come around to being +0 on the idea of making the float() and decimal.Decimal() __eq__ definitions reflexive, but doing so does have implications when it comes to the ability to accurately save and restore application state. It isn't as simple as just adding "if self is other: return True" to the respective __eq__ implementations. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Apr 28, 2011 at 4:40 PM, Nick Coghlan
Pondering the NaN problem further, I think we can relatively easily argue that reflexive behaviour at the object level fits within the scope of IEEE754.
Now we're talking. :-)
1. IEEE754 is a value-based system, with a finite number of distinct NaN payloads 2. Python is an object-based system. In addition to their payload, NaN objects are further distinguished by their identity (infinite in theory, in practice limited by available memory). 3. We can still technically be conformant with IEEE754 even if we say that a given NaN object is equivalent to itself, but not to other NaN objects with the same payload.
Unfortunately, this still doesn't play well with serialisation, which assumes that the identity of float objects doesn't matter:
import pickle nan = float('nan') x = [nan, nan] x[0] is x[1] True y = pickle.loads(pickle.dumps(x)) y [nan, nan] y[0] is y[1] False
Contrast that with the handling of lists, where identity is known to be significant:
x = [[]]*2 x[0] is x[1] True y = pickle.loads(pickle.dumps(x)) y [[], []] y[0] is y[1] True
Probably wouldn't kill us if fixed pickle to take object identity into account for floats whose value is nan. (Fortunately for 3rd party types pickle always preserves identity.)
I'd say I've definitely come around to being +0 on the idea of making the float() and decimal.Decimal() __eq__ definitions reflexive, but doing so does have implications when it comes to the ability to accurately save and restore application state. It isn't as simple as just adding "if self is other: return True" to the respective __eq__ implementations.
But it seems pickle is *already* broken, so that can't really be an argument against the proposed change, right? -- --Guido van Rossum (python.org/~guido)
On Thu, Apr 28, 2011 at 7:47 PM, Guido van Rossum
On Thu, Apr 28, 2011 at 4:40 PM, Nick Coghlan
wrote: Pondering the NaN problem further, I think we can relatively easily argue that reflexive behaviour at the object level fits within the scope of IEEE754.
Now we're talking. :-)
Note that Kahan is very critical of Java's approach, but NaN objects' comparison is not on his list of Java warts: http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
On 4/28/2011 4:40 PM, Nick Coghlan wrote:
Hmm, true. And things like count() and index() would still be thoroughly broken for sequences. OK, so scratch that idea - there's simply no sane way to handle such objects without using an identity-based container that ignores equality definitions altogether.
And the problem with that is that not all values are interned, to share a single identity per value, correct? On the other hand, proliferation of float objects containing NaN "works", thus so would proliferation of non-float objects of the same value... but "works" would have a different meaning when there could be multiple identities of 6,981,433 in the same set. But this does bring up an interesting enough point to cause me to rejoin the conversation: Would it be reasonable to implement 3 types of containers: 1) using __eq__ (would not use identity comparison optimization) 2) using is (the case you describe above) 3) the status quo: is or __eq__ The first two would require an explicit constructor call because the syntax would be retained for case 3 for backward compatibility. Heavy users of NaN and other similar values might find case 1 useful, although they would need to be careful with mappings and sets. Heavy users of NumPy and other similar structures might find case 2 useful. Offering the choice, and documenting the alternatives may make a lot more programmers choose the proper comparison operations, and less likely to overlook or pooh-pooh the issue with the thought that it won't happen to their program anyway...
On Thu, Apr 28, 2011 at 5:24 PM, Glenn Linderman
Would it be reasonable to implement 3 types of containers:
That's something for python-ideas. Occasionally containers that use custom comparisons come in handy (e.g. case-insensitive dicts). -- --Guido van Rossum (python.org/~guido)
Nick Coghlan wrote:
1. IEEE754 is a value-based system, with a finite number of distinct NaN payloads 2. Python is an object-based system. In addition to their payload, NaN objects are further distinguished by their identity (infinite in theory, in practice limited by available memory).
I argue that's an implementation detail that makes no difference. NANs should compare unequal, including to itself. That's the clear intention of IEEE-754. There's no exception made for "unless y is another name for x". If there was, NANs would be reflexive, and we wouldn't be having this discussion, but the non-reflexivity of NANs is intended behaviour. The clear equivalent to object identity in value-languages is memory location. If you compare variable x to the same x, IEEE754 says you should get False. Consider: # Two distinct NANs are calculated somewhere... x = float('nan') y = float('nan') # They find themselves in some data in some arbitrary place seq = [1, 2, x, y] random.shuffle(seq) # And later x is compared to some arbitrary element in the data if math.isnan(x): if x == seq[0]: print("Definitely not a NAN") nan != x is an important invariant, breaking it just makes NANs more complicated and less useful. Tests will need to be written "if x == y and not math.isnan(x)" to avoid getting the wrong result for NANs. I don't see what the problem is that we're trying to fix. If containers wish to define container equality as taking identity into account, good for the container. Document it and move on, but please don't touch floats. -- Steven
On 4/28/11 6:13 PM, Guido van Rossum wrote:
On Thu, Apr 28, 2011 at 4:04 PM, Nick Coghlan
wrote: On Fri, Apr 29, 2011 at 8:55 AM, Guido van Rossum
wrote: Sorry, we'll have to make an exception for those of course. This will somewhat complicate the interpretation of well-behaved, because those are *not* well-behaved as far as containers go (both dict key lookup and list membership are affected) but it is not a bug -- however it is a bug to put these in containers and then use container comparisons on the container.
That's a point in favour of the status quo, though - with the burden of enforcing reflexivity placed on the containers, types are free to make use of rich comparisons to return more than just simple True/False results.
Possibly. Though for types that *do* return True/False, NaN's behavior can still be infuriating.
I hadn't really thought about it that way before this discussion - it is the identity checking behaviour of the builtin containers that lets us sensibly handle cases like sets of NumPy arrays.
But do they? For non-empty arrays, __eq__ will always return something that is considered true,
Actually, numpy.ndarray.__nonzero__() raises an exception. We've decided that there are no good conventions for deciding whether an array should be considered True or False that won't mislead people. It's quite astonishing how many people will just test "if x == y:" or "if x != y:" for a single set of inputs and "confirm" their guess as to the general rule from that. But your point stands, numpy arrays cannot be members of sets or keys of dicts or orderable/"in-able" elements of lists. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Nick Coghlan wrote:
I hadn't really thought about it that way before this discussion - it is the identity checking behaviour of the builtin containers that lets us sensibly handle cases like sets of NumPy arrays.
Except that it doesn't:
from numpy import array a1 = array([1,2]) a2 = array([3,4]) s = set([a1, a2]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'numpy.ndarray'
Lists aren't trouble-free either:
lst = [a1, a2] a2 in lst Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
-- Greg
On Fri, Apr 29, 2011 at 2:55 AM, Guido van Rossum
Raymond strongly believes that containers must be allowed to use the modified definition, I believe purely for performance reasons. (Without this rule, a list or tuple could not even cut short being compared to *itself*.) It seems you are in that camp too.
I'm a fan of the status quo, but not just for performance reasons - there is quite a bit of set theory that breaks once you allow non-reflexive equality*, so it makes sense to me to make it official that containers should ignore any non-reflexivity they come across. *To all the mathematicians in the audience yelling at their screens that the very idea of "non-reflexive equality" is an oxymoron... yes, I know :P Cheers, Nick. P.S. It's hard to explain the slightly odd point of view that seeing standard arithmetic constructed from Peano's Axioms and set theory can give you on discussions like this. It's a seriously different (and strange) way of thinking about the basic arithmetic constructs we normally take for granted, though :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Apr 28, 2011 at 12:24 AM, Guido van Rossum
So do new masks get created when the outcome of an elementwise operation is a NaN? Because that's the only reason why one should have NaNs in one's data in the first place.
If this is the case, why Python almost never produces NaNs as IEEE standard prescribes?
0.0/0.0 Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: float division
-- not to indicate missing values!
Sometimes you don't have a choice. For example when you data comes from a database that uses NaNs for missing values.
Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
That's probably as good an idea as anything. The weirdness of NaNs is supposed to ensure that they propagate through a computation as a kind of exception signal. But to make that work properly, comparing two NaNs should really give you a NaB (Not a Boolean). As long as we're not doing that, we might as well treat NaNs sanely as Python objects. -- Greg
Greg Ewing wrote:
Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
That's probably as good an idea as anything.
The weirdness of NaNs is supposed to ensure that they propagate through a computation as a kind of exception signal. But to make that work properly, comparing two NaNs should really give you a NaB (Not a Boolean). As long as we're not doing that, we might as well treat NaNs sanely as Python objects.
That doesn't follow. You can compare NANs, and the result of the comparisons are perfectly well defined by either True or False. There's no need for a NAB comparison flag. -- Steven
Steven D'Aprano wrote:
You can compare NANs, and the result of the comparisons are perfectly well defined by either True or False.
But it's *arbitrarily* defined, and it's far from clear that the definition chosen is useful in any way. If you perform a computation and get a NaN as the result, you know that something went wrong at some point. But if you subject that NaN to a comparison, your code takes some arbitrarily-chosen branch and produces a result that may look plausible but is almost certainly wrong. The Pythonic thing to do (in the Python 3 world at least) would be to regard NaNs as non-comparable and raise an exception. -- Greg
On Thu, Apr 28, 2011 at 1:40 AM, Greg Ewing
The Pythonic thing to do (in the Python 3 world at least) would be to regard NaNs as non-comparable and raise an exception.
As I mentioned in a previous post, I agree in case of <, <=, >, or >= comparisons, but == and != are a harder case because you don't want, for example:
[1,2,float('nan'),3].index(3) 3
to raise an exception.
Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior?
This doesn't solve the broader problem that *any* type might deliberately define non-reflexive equality, and therefore people will still be surprised by
x = SomeObject() x == x False [x] == [x] True
The "problem" (if it is a problem) here is list, not NANs. Please don't break NANs to not-fix a problem with list. Since we can't (can we?) prohibit non-reflexivity, and even if we can, we shouldn't, reasonable solutions are: (1) live with the fact that lists and other built-in containers will short-cut equality with identity for speed, ignoring __eq__; (2) slow containers down by guaranteeing that they will use __eq__; (but how much will it actually hurt performance for real-world cases? and this will have the side-effect that non-reflexivity will propagate to containers) (3) allow types to register that they are non-reflexive, allowing containers to skip the identity shortcut when necessary. (but it is not clear to me that the extra complexity will be worth the cost) My vote is the status quo, (1). -- Steven
On 4/27/2011 5:05 PM, Steven D'Aprano wrote:
(2) slow containers down by guaranteeing that they will use __eq__;
(but how much will it actually hurt performance for real-world cases? and this will have the side-effect that non-reflexivity will propagate to containers)
I think it is perfectly reasonable that containers containing items with non-reflexive equality should sometimes have non-reflexive equality also (depends on the placement of the item in the container, and the values of other items, whether the non-reflexive equality of an internal item will actually affect the equality of the container in practice). I quoted the docs for tuple and list comparisons in a different part of this thread, and for those types, the docs are very clear that the items must compare equal for the lists or tuples to compare equal. For other built-in types, the docs are less clear: * Mappings (dictionaries) compare equal if and only if they have the same (key, value) pairs. Order comparisons ('<', '<=', '>=', '>') raise TypeError http://docs.python.org/py3k/library/exceptions.html#TypeError. So we can immediately conclude that mappings do not provide an ordering for sorts. But, the language "same (key, value)" pairs implies identity comparisons, rather than equality comparisons. But in practice, equality is used sometimes, and identity sometimes:
nan = float('NaN') d1 = dict( a=1, nan=2 ) d2 = dict( a=1, nan=2.0 ) d1 == d2 True 2 is 2.0 False
"nan" and "nan" is being compared using identity, 2 and 2.0 by equality. While that may be clear to those of you that know the implementation (and even have described it somewhat in this thread), it is certainly not clear in the docs. And I think it should read much more like lists and tuples... "if all the (key, value) pairs, considered as tuples, are equal". * Sets and frozensets define comparison operators to mean subset and superset tests. Those relations do not define total orderings (the two sets {1,2} and {2,3} are not equal, nor subsets of one another, nor supersets of one another). Accordingly, sets are not appropriate arguments for functions which depend on total ordering. For example, min() http://docs.python.org/py3k/library/functions.html#min, max() http://docs.python.org/py3k/library/functions.html#max, and sorted() http://docs.python.org/py3k/library/functions.html#sorted produce undefined results given a list of sets as inputs. This clearly talks about sets and subsets, but it doesn't define those concepts well in this section. It should refer to where it that concept is defined, perhaps. The intuitive definition of "subset" to me is if, for every item in set A, if an equal item is found in set B, then set A is a subset of set B. That's what I learned back in math classes. Since NaN is not equal to NaN, however, I would not expect a set containing NaN to compare equal to any other set.
On 4/27/2011 6:15 PM, Glenn Linderman wrote:
I think it is perfectly reasonable that containers containing items with non-reflexive equality should sometimes have non-reflexive equality also (depends on the placement of the item in the container, and the values of other items, whether the non-reflexive equality of an internal item will actually affect the equality of the container in practice).
Pardon me, please ignore the parenthetical statement... it was really inspired by inequality comparisons, not equality comparisons.
I am not a specialist in this area (although I call myself a mathematician). But they say that sometimes the outsider sees most of the game, or more likely that sometimes the idiot's point of view is useful. To me the idea of non-reflexive equality (an object not being equal to itself) is abhorrent. Nothing is more likely to put off new Python users if they happen to run into it. And I bet even very experienced programmers will be tripped up by it a good proportion of the time they hit it. Basically it's deferring to a wart, of dubious value, in floating point calculations and/or the IEEE754 standard, and allowing it to become a monstrous carbuncle disfiguring the whole language. I think implementations of equal/not-equal which are make equality non-reflexive (and thus break "identity implies equality") should be considered broken. On 27/04/2011 15:53, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior? Right on, Guido. (A pity that a lot of people don't seem to be listening.)
On 27/04/2011 17:05, Isaac Morland wrote:
Python could also provide IEEE-754 equality as a function (perhaps in "math"), something like:
def ieee_equal (a, b): return a == b and not isnan (a) and not isnan (b)
Of course, the definition of math.isnan cannot then be by checking its argument by comparison with itself Damn right - a really dirty trick if ever I saw one (not even proof against the introduction of new objects which also have the same
Quite. If atypical behaviour is required in specialised areas, it can be coded for. (Same goes for specialised functions for comparing lists, dictionaries etc. in non-standard ways. Forced explicit is better than well-hidden implicit.) perverse non-reflexive equality).
- it would have to check the appropriate bits of the float representation. So it should.
After all, why discard centuries of mathematical experience based on a decision that the IEEE754 committe can't clearly recall the rationale for, and didn't clearly document? Sorry Nick, I have quoted you out of context - you WEREN'T arguing for
On 28/04/2011 11:11, Nick Coghlan wrote: the same point of view. But you express it much better than I could. It occurred to me that the very length of this thread [so far!] perfectly illustrates how controversial non-reflexive "equality" is. (BTW I have read, if not understood, every post to this thread and will continue to read them all.) And then I came across: On 28/04/2011 09:43, Alexander Belopolsky wrote:
If nothing else, annual reoccurrence of long threads on this topic is a reason enough to reconsider which standard to follow. Aha, this is is a regular, is it? 'Nuff said!
Best wishes Rob Cliffe
Rob Cliffe wrote:
To me the idea of non-reflexive equality (an object not being equal to itself) is abhorrent. Nothing is more likely to put off new Python users if they happen to run into it.
I believe that's a gross exaggeration. In any case, that's just your opinion, and Python is hardly the only language that supports (at least partially) NANs. Besides, floats have all sorts of unintuitive properties that go against properties of real numbers, and new users manage to cope. With floats, even ignoring NANs, you can't assume: a*(b+c) == a*b + a*c a+b+c = c+b+a 1.0/x*x = 1 x+y-x = y x+1 > x or many other properties of real numbers. In real code, the lack of reflexivity for NANs is just not that important. You can program for *years* without once accidentally stumbling over one, whereas you can't do the simplest floating point calculation without stubbing your toes on things like this:
1.0/10 0.10000000000000001
Search the archives of the python-list@python.org mailing list. You will find regular questions from newbies similar to "Why doesn't Python calculate 1/10 correctly, is it broken?" (Except that most of the time they don't *ask* if it's broken, they just declare that it is.) Compared to that, which is concrete and obvious and frequent, NANs are usually rare and mild. The fact is, NANs are useful. Less useful in Python, which goes out of the way to avoid generating them (a pity, in my opinion), but still useful.
Basically it's deferring to a wart, of dubious value, in floating point calculations and/or the IEEE754 standard, and allowing it to become a monstrous carbuncle disfiguring the whole language.
A ridiculous over-reaction. How long have you been programming in Python? Months? Years? If the language was "disfigured" by a "monstrous carbuncle", you haven't noticed until now.
I think implementations of equal/not-equal which are make equality non-reflexive (and thus break "identity implies equality") should be considered broken.
Then Python is broken by design, because by design *all* rich comparison methods can do anything.
On 27/04/2011 15:53, Guido van Rossum wrote:
Maybe we should just call off the odd NaN comparison behavior? Right on, Guido. (A pity that a lot of people don't seem to be listening.)
Oh we're listening. Some of us are just *disagreeing*. -- Steven
On 28/04/2011 18:26, Steven D'Aprano wrote:
Rob Cliffe wrote:
To me the idea of non-reflexive equality (an object not being equal to itself) is abhorrent. Nothing is more likely to put off new Python users if they happen to run into it.
I believe that's a gross exaggeration. In any case, that's just your opinion, and Python is hardly the only language that supports (at least partially) NANs.
Besides, floats have all sorts of unintuitive properties that go against properties of real numbers, and new users manage to cope.
With floats, even ignoring NANs, you can't assume:
a*(b+c) == a*b + a*c a+b+c = c+b+a 1.0/x*x = 1 x+y-x = y x+1 > x
or many other properties of real numbers. In real code, the lack of reflexivity for NANs is just not that important. You can program for *years* without once accidentally stumbling over one, whereas you can't do the simplest floating point calculation without stubbing your toes on things like this:
1.0/10 0.10000000000000001
Of course, these are inevitable consequences of floating-point representation. Inevitable in just about *any* language.
The fact is, NANs are useful. Less useful in Python, which goes out of the way to avoid generating them (a pity, in my opinion), but still useful.
I am not arguing against the use of NANs. Or even against different NANs not being equal to each other. What I was arguing about was the behaviour of Python objects that represent NANs, specifically in allowing x == x to be False, something which is *not* inevitable but a choice of language design or usage. Rob Cliffe
On Wed, Apr 27, 2011 at 10:37 AM, Hrvoje Niksic
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
That one surprises me a bit too: I knew we were using identity-then-equality checks for containment (nan in [nan]), but I hadn't realised identity-then-equality was also used for the item-by-item comparisons when comparing two lists. It's defensible, though: [nan] == [nan] should presumably produce the same result as {nan} == {nan}, and the latter is a test that's arguably based on containment (for sets s and t, s == t if each element of s is in t, and vice versa). I don't think any of this should change. It seems to me that we've currently got something approaching the best approximation to consistency and sanity achievable, given the fundamental incompatibility of (1) nan breaking reflexivity of equality and (2) containment being based on equality. That incompatibility is bound to create inconsistencies somewhere along the line. Declaring that 'nan == nan' should be True seems attractive in theory, but I agree that it doesn't really seem like a realistic option in terms of backwards compatibility and compatibility with other mainstream languages. Mark
On 4/27/2011 2:04 PM, Mark Dickinson wrote:
The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc. That one surprises me a bit too: I knew we were using identity-then-equality checks for containment (nan in [nan]), but I hadn't realised identity-then-equality was also used for the item-by-item comparisons when comparing two lists. It's defensible,
On Wed, Apr 27, 2011 at 10:37 AM, Hrvoje Niksic
wrote: though: [nan] == [nan] should presumably produce the same result as {nan} == {nan}, and the latter is a test that's arguably based on containment (for sets s and t, s == t if each element of s is in t, and vice versa). I don't think any of this should change. It seems to me that we've currently got something approaching the best approximation to consistency and sanity achievable, given the fundamental incompatibility of (1) nan breaking reflexivity of equality and (2) containment being based on equality. That incompatibility is bound to create inconsistencies somewhere along the line.
Declaring that 'nan == nan' should be True seems attractive in theory, but I agree that it doesn't really seem like a realistic option in terms of backwards compatibility and compatibility with other mainstream languages.
I think it should change. Inserting a NaN, even the same instance of NaN into a list shouldn't suddenly make it compare equal to itself, especially since the docs (section 5.9. Comparisons) say: * Tuples and lists are compared lexicographically using comparison of corresponding elements. This means that to compare equal, each element must compare equal and the two sequences must be of the same type and have the same length. If not equal, the sequences are ordered the same as their first differing elements. For example, [1,2,x] <= [1,2,y] has the same value as x <= y. If the corresponding element does not exist, the shorter sequence is ordered first (for example, [1,2] < [1,2,3]). The principle of least surprise, says that if two unequal items are inserted into otherwise equal lists, the lists should be unequal. NaN is unequal to itself.
Mark Dickinson wrote:
On Wed, Apr 27, 2011 at 10:37 AM, Hrvoje Niksic
wrote: The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
That one surprises me a bit too: I knew we were using identity-then-equality checks for containment (nan in [nan]), but I hadn't realised identity-then-equality was also used for the item-by-item comparisons when comparing two lists. It's defensible, though: [nan] == [nan] should presumably produce the same result as {nan} == {nan}, and the latter is a test that's arguably based on containment (for sets s and t, s == t if each element of s is in t, and vice versa).
I don't think any of this should change. It seems to me that we've currently got something approaching the best approximation to consistency and sanity achievable, given the fundamental incompatibility of (1) nan breaking reflexivity of equality and (2) containment being based on equality. That incompatibility is bound to create inconsistencies somewhere along the line.
Declaring that 'nan == nan' should be True seems attractive in theory, but I agree that it doesn't really seem like a realistic option in terms of backwards compatibility and compatibility with other mainstream languages.
Totally out of my depth, but what if the a NaN object was allowed to compare equal to itself, but different NaN objects still compared unequal? If NaN was a singleton then the current behavior makes more sense, but since we get a new NaN with each instance creation is there really a good reason why the same NaN can't be equal to itself? ~Ethan~
On 4/27/2011 6:11 PM, Ethan Furman wrote:
Mark Dickinson wrote:
On Wed, Apr 27, 2011 at 10:37 AM, Hrvoje Niksic
wrote: The other day I was surprised to learn this:
nan = float('nan') nan == nan False [nan] == [nan] True # also True in tuples, dicts, etc.
That one surprises me a bit too: I knew we were using identity-then-equality checks for containment (nan in [nan]), but I hadn't realised identity-then-equality was also used for the item-by-item comparisons when comparing two lists. It's defensible, though: [nan] == [nan] should presumably produce the same result as {nan} == {nan}, and the latter is a test that's arguably based on containment (for sets s and t, s == t if each element of s is in t, and vice versa).
I don't think any of this should change. It seems to me that we've currently got something approaching the best approximation to consistency and sanity achievable, given the fundamental incompatibility of (1) nan breaking reflexivity of equality and (2) containment being based on equality. That incompatibility is bound to create inconsistencies somewhere along the line.
Declaring that 'nan == nan' should be True seems attractive in theory, but I agree that it doesn't really seem like a realistic option in terms of backwards compatibility and compatibility with other mainstream languages.
Totally out of my depth, but what if the a NaN object was allowed to compare equal to itself, but different NaN objects still compared unequal? If NaN was a singleton then the current behavior makes more sense, but since we get a new NaN with each instance creation is there really a good reason why the same NaN can't be equal to itself?
n1 = float('NaN') n2 = float('NaN') n3 = n1
n1 nan n2 nan n3 nan
[n1] == [n2] False [n1] == [n3] True
This is the current situation: some NaNs compare equal sometimes, and some don't. And unless you are particularly aware of the identity of the object containing the NaN (not the list, but the particular NaN value) it is surprising and confusing, because the mathematical definition of NaN is that it should not be equal to itself.
Glenn Linderman writes:
On 4/27/2011 6:11 PM, Ethan Furman wrote:
Totally out of my depth, but what if the a NaN object was allowed to compare equal to itself, but different NaN objects still compared unequal? If NaN was a singleton then the current behavior makes more sense, but since we get a new NaN with each instance creation is there really a good reason why the same NaN can't be equal to itself?
Yes. A NaN is a special object that means "the computation that produced this object is undefined." For example, consider the computation 1/x at x = 0. If you approach from the left, 1/0 "obviously" means minus infinity, while if you approach from the right just as obviously it means plus infinity. So what does the 1/0 that occurs in [1/x for x in range(-5, 6)] mean? In what sense is it "equal to itself"? How can something which is not a number be compared for numerical equality?
n1 = float('NaN') n2 = float('NaN') n3 = n1
n1 nan n2 nan n3 nan
[n1] == [n2] False [n1] == [n3] True
This is the current situation: some NaNs compare equal sometimes, and some don't.
No, Ethan is asking for "n1 == n3" => True. As Mark points out, "[n1] == [n3]" can be interpreted as a containment question, rather than an equality question, with respect to the NaNs themselves. In standard set theory, these are the same question, but that's not necessarily so in other set-like toposes. In particular, getting equality and set membership to behave reasonably with respect to each other one of the problems faced in developing a workable theory of fuzzy sets. I don't think it matters what behavior you choose for NaNs, somebody is going be unhappy sometimes.
On 4/27/2011 8:06 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 4/27/2011 6:11 PM, Ethan Furman wrote:
Totally out of my depth, but what if the a NaN object was allowed to compare equal to itself, but different NaN objects still compared unequal? If NaN was a singleton then the current behavior makes more sense, but since we get a new NaN with each instance creation is there really a good reason why the same NaN can't be equal to itself?
Yes. A NaN is a special object that means "the computation that produced this object is undefined." For example, consider the computation 1/x at x = 0. If you approach from the left, 1/0 "obviously" means minus infinity, while if you approach from the right just as obviously it means plus infinity. So what does the 1/0 that occurs in [1/x for x in range(-5, 6)] mean? In what sense is it "equal to itself"? How can something which is not a number be compared for numerical equality?
n1 = float('NaN') n2 = float('NaN') n3 = n1
n1 nan n2 nan n3 nan
[n1] == [n2] False [n1] == [n3] True
This is the current situation: some NaNs compare equal sometimes, and some don't.
No, Ethan is asking for "n1 == n3" => True. As Mark points out, "[n1] == [n3]" can be interpreted as a containment question, rather than an equality question, with respect to the NaNs themselves.
It _can_ be interpreted as a containment question, but doing so is contrary to the documentation of Python list comparison, which presently doesn't match the implementation. The intuitive definition of equality of lists is that each member is equal. The presence of NaN destroys intuition of people that don't expect them to be as different from numbers as they actually are, but for people that understand NaNs and expect them to behave according to their definition, then the presence of a NaN in a list would be expected to cause the list to not be equal to itself, because a NaN is not equal to itself.
In standard set theory, these are the same question, but that's not necessarily so in other set-like toposes. In particular, getting equality and set membership to behave reasonably with respect to each other one of the problems faced in developing a workable theory of fuzzy sets.
I don't think it matters what behavior you choose for NaNs, somebody is going be unhappy sometimes.
Some people will be unhappy just because they exist in the language, so I agree :)
Stephen J. Turnbull wrote:
So what does the 1/0 that occurs in [1/x for x in range(-5, 6)] mean? In what sense is it "equal to itself"? How can something which is not a number be compared for numerical equality?
I would say it *can't* be compared for *numerical* equality. It might make sense to compare it using some other notion of equality. One of the problems here, I think, is that Python only lets you define one notion of equality for each type, and that notion is the one that gets used when you compare collections of that type. (Or at least it's supposed to, but the identity- implies-equality shortcut that gets taken in some places interferes with that.) So if you're going to decide that it doesn't make sense to compare undefined numeric quantities, then it doesn't make sense to compare lists containing them either. -- Greg
Mark Dickinson writes:
Declaring that 'nan == nan' should be True seems attractive in theory,
No, it's intuitively attractive, but that's because humans like nice continuous behavior. In *theory*, it's true that some singularities are removable, and the NaN that occurs when evaluating at that point is actually definable in a broader context, but the point of NaN is that some singularities are *not* removable. This is somewhat Pythonic: "In the presence of ambiguity, refuse to guess."
On Thu, Apr 28, 2011 at 12:42 PM, Stephen J. Turnbull
Mark Dickinson writes:
> Declaring that 'nan == nan' should be True seems attractive in > theory,
No, it's intuitively attractive, but that's because humans like nice continuous behavior. In *theory*, it's true that some singularities are removable, and the NaN that occurs when evaluating at that point is actually definable in a broader context, but the point of NaN is that some singularities are *not* removable. This is somewhat Pythonic: "In the presence of ambiguity, refuse to guess."
Refusing to guess in this case would be to treat all NaNs as signalling NaNs, and that wouldn't be good, either :) I like Terry's suggestion for a glossary entry, and have created an updated proposal at http://bugs.python.org/issue11945 (I also noted that array.array is like collections.Sequence in failing to enforce the container invariants in the presence of NaN values) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Apr 27, 2011 at 8:43 PM, Nick Coghlan
(I also noted that array.array is like collections.Sequence in failing to enforce the container invariants in the presence of NaN values)
Regardless of whether we go any further it would indeed be good to be explicit about the rules in the language reference and fix the behavior of collections.Sequence. I'm not sure about array.array -- it doesn't hold objects so I don't think there's anything to enforce. It seems to behave the same way as NumPy arrays when they don't contain objects. -- --Guido van Rossum (python.org/~guido)
On Thu, Apr 28, 2011 at 2:07 PM, Guido van Rossum
I'm not sure about array.array -- it doesn't hold objects so I don't think there's anything to enforce. It seems to behave the same way as NumPy arrays when they don't contain objects.
Yep, after reading Robert's post I realised the point about native arrays in NumPy (and the lack of "object identity" in those cases) applied equally well to the array module. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/27/2011 8:43 PM, Nick Coghlan wrote:
On Thu, Apr 28, 2011 at 12:42 PM, Stephen J. Turnbull
wrote: Mark Dickinson writes:
Declaring that 'nan == nan' should be True seems attractive in theory,
No, it's intuitively attractive, but that's because humans like nice continuous behavior. In *theory*, it's true that some singularities are removable, and the NaN that occurs when evaluating at that point is actually definable in a broader context, but the point of NaN is that some singularities are *not* removable. This is somewhat Pythonic: "In the presence of ambiguity, refuse to guess." Refusing to guess in this case would be to treat all NaNs as signalling NaNs, and that wouldn't be good, either :)
I like Terry's suggestion for a glossary entry, and have created an updated proposal at http://bugs.python.org/issue11945
(I also noted that array.array is like collections.Sequence in failing to enforce the container invariants in the presence of NaN values)
In that bug, Nick, you mention that reflexive equality is something that container classes rely on in their implementation. Such reliance seems to me to be a bug, or an inappropriate optimization, rather than a necessity. I realize that classes that do not define equality use identity as their default equality operator, and that is acceptable for items that do not or cannot have any better equality operator. It does lead to the situation where two objects that are bit-for-bit clones get separate entries in a set... exactly the same as how NaNs of different identity work... the situation with a NaN of the same identity not being added to the set multiple times seems to simply be a bug because of conflating identity and equality, and should not be relied on in container implementations.
On Thu, Apr 28, 2011 at 2:20 AM, Glenn Linderman
In that bug, Nick, you mention that reflexive equality is something that container classes rely on in their implementation. Such reliance seems to me to be a bug, or an inappropriate optimization, ..
An alternative interpretation would be that it is a bug to use NaN values in lists. It is certainly nonsensical to use NaNs as keys in dictionaries and that reportedly led Java designers to forgo the nonreflexivity of nans: """ A "NaN" value is not equal to itself. However, a "NaN" Java "Float" object is equal to itself. The semantic is defined this way, because otherwise "NaN" Java "Float" objects cannot be retrieved from a hash table. """ - http://www.concentric.net/~ttwang/tech/javafloat.htm With the status quo in Python, it may only make sense to store NaNs in array.array, but not in a list.
Alexander Belopolsky wrote:
With the status quo in Python, it may only make sense to store NaNs in array.array, but not in a list.
That's a bit extreme. It only gets you into trouble if you reason like this:
a = b = [1, 2, 3, float('nan')] if a == b: ... for x,y in zip(a,b): ... assert x==y ... Traceback (most recent call last): File "<stdin>", line 3, in <module> AssertionError
But it's perfectly fine to do this:
sum(a) nan
exactly as expected. Prohibiting NANs from lists is massive overkill for a small (alleged) problem. I know thousands of words have been spilled on this, including many by myself, but I really believe this discussion is mostly bike-shedding. Given the vehemence of some replies, and the volume of talk, anyone would think that you could hardly write a line of Python code without badly tripping over problems caused by NANs. The truth is, I think, that most people will never see one in real world code, and those who are least likely to come across them are the most likely to be categorically against them. (I grant that Alexander is an exception -- I understand that he does do a lot of numeric work, and does come across NANs, and still doesn't like them one bit.) -- Steven
Steven D'Aprano wrote:
I know thousands of words have been spilled on this, including many by myself, but I really believe this discussion is mostly bike-shedding.
Hmmm... on reflection, I think I may have been a bit unfair. In particular, I don't mean any slight on any of the people who have made intelligent, insightful posts, even if I disagree with them. -- Steven
On Thu, Apr 28, 2011 at 1:25 PM, Steven D'Aprano
But it's perfectly fine to do this:
sum(a) nan
This use case reminded me Kahan's """ Were there no way to get rid of NaNs, they would be as useless as Indefinites on CRAYs; as soon as one were encountered, computation would be best stopped rather than continued for an indefinite time to an Indefinite conclusion. """ http://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps More often than not, you would want to sum non-NaN values instead. ..
(I grant that Alexander is an exception -- I understand that he does do a lot of numeric work, and does come across NANs, and still doesn't like them one bit.)
I like NaNs for high-performance calculations, but once you wrap floats individually in Python objects, performance is killed and you are better off using None instead of NaN. Python lists don't support element-wise operations and therefore there is little gain from being able to write x + y in loops over list elements instead of ieee_add(x, y) or add_or_none(x, y) with proper definitions of these functions. On the other hand, __eq__ gets invoked implicitly in cases where you don't access to the loop. Your only choice is to filter your data before invoking such operations.
Steven D'Aprano writes:
(I grant that Alexander is an exception -- I understand that he does do a lot of numeric work, and does come across NANs, and still doesn't like them one bit.)
I don't think I'd want anybody who *likes* NaNs to have a commit bit at python.org.<shiver/>
On Thu, Apr 28, 2011 at 4:20 PM, Glenn Linderman
In that bug, Nick, you mention that reflexive equality is something that container classes rely on in their implementation. Such reliance seems to me to be a bug, or an inappropriate optimization, rather than a necessity. I realize that classes that do not define equality use identity as their default equality operator, and that is acceptable for items that do not or cannot have any better equality operator. It does lead to the situation where two objects that are bit-for-bit clones get separate entries in a set... exactly the same as how NaNs of different identity work... the situation with a NaN of the same identity not being added to the set multiple times seems to simply be a bug because of conflating identity and equality, and should not be relied on in container implementations.
No, as Raymond has articulated a number of times over the years, it's a property of the equivalence relation that is needed in order to present sane invariants to users of the container. I included in the bug report the critical invariants I am currently aware of that should hold, even when the container may hold types with a non-reflexive definition of equality: assert [x] == [x] # Generalised to all container types assert not [x] != [x] # Generalised to all container types for x in c: assert x in c assert c.count(x) > 0 # If applicable assert 0 <= c.index(x) < len(c) # If applicable The builtin types all already work this way, and that's a deliberate choice - my proposal is simply to document the behaviour as intentional, and fix the one case I know of in the standard library where we don't implement these semantics correctly (i.e. collections.Sequence). The question of whether or not float and decimal.Decimal should be modified to have reflexive definitions of equality (even for NaN values) is actually orthogonal to the question of clarifying and documenting the expected semantics of containers in the face of non-reflexive definitions of equality. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/27/2011 11:54 PM, Nick Coghlan wrote:
In that bug, Nick, you mention that reflexive equality is something that container classes rely on in their implementation. Such reliance seems to me to be a bug, or an inappropriate optimization, rather than a necessity. I realize that classes that do not define equality use identity as their default equality operator, and that is acceptable for items that do not or cannot have any better equality operator. It does lead to the situation where two objects that are bit-for-bit clones get separate entries in a set... exactly the same as how NaNs of different identity work... the situation with a NaN of the same identity not being added to the set multiple times seems to simply be a bug because of conflating identity and equality, and should not be relied on in container implementations. No, as Raymond has articulated a number of times over the years, it's a property of the equivalence relation that is needed in order to
On Thu, Apr 28, 2011 at 4:20 PM, Glenn Linderman
wrote: present sane invariants to users of the container.
I probably wasn't around when Raymond did his articulation :) Sorry for whatever amount of rehashing I'm doing here -- pointers to some of the articulation would be welcome, but perhaps the summary below is intended to recap the results of such discussions. If my comments below seem to be grasping the essence of those discussions, then no need for the pointers... if I'm way off, I'd like to read a thread or two.
I included in the bug report the critical invariants I am currently aware of that should hold, even when the container may hold types with a non-reflexive definition of equality:
assert [x] == [x] # Generalised to all container types assert not [x] != [x] # Generalised to all container types for x in c: assert x in c assert c.count(x)> 0 # If applicable assert 0<= c.index(x)< len(c) # If applicable
The builtin types all already work this way, and that's a deliberate choice - my proposal is simply to document the behaviour as intentional, and fix the one case I know of in the standard library where we don't implement these semantics correctly (i.e. collections.Sequence).
The question of whether or not float and decimal.Decimal should be modified to have reflexive definitions of equality (even for NaN values) is actually orthogonal to the question of clarifying and documenting the expected semantics of containers in the face of non-reflexive definitions of equality.
Yes, I agree they are orthogonal questions... separate answers and choices can be made for specific classes, just like some classes implement equality using identity, it would also be possible to implement identity using equality, and it is possible to conflate the two as has apparently been deliberately done for Python containers, without reflecting that in the documentation. If the containers have been deliberately implemented in that way, and it is not appropriate to change them, then more work is needed in the documentation than just your proposed Glossary definition, as the very intuitive descriptions in the Comparisons section are quite at odds with the current implementation. Without having read the original articulations by Raymond or any discussions of the pros and cons, it would appear that the above list of invariants, which you refer to as "sane", are derived from a "pre-NaN" or "reflexive equality" perspective; while some folk perhaps think the concept of NaN is a particular brand of insanity, it is a standard brand, and therefore worthy of understanding and discussion. And clearly, if the NaN perspective is intentionally corralled in Python, then the documentation needs to be clarified. On the other hand, the SQL language has embraced the same concept as NaN in its concept of NULL, and has pushed that concept (they call it three-valued logic, I think) clear through the language. NULL == NULL is not True, and it is not False, but it is NULL. Of course, the language is different in other ways that Python; values are not objects and have no identity, but they do have collections of values called tuples, columns, and tables, which are similar to lists and lists of lists. And they have mappings called indexes. And they've made it all work with the concept of NULL and three-valued logic. And sane people work with database systems built around such concepts. So I guess I reject the argument that the above invariants are required for sanity. On the other hand, having not much Python internals knowledge as yet, I'm in no position to know how seriously things would break internally should a different set of invariants that embrace and extend the concept of non-reflexive equality were to be invented to replace the above, nor whether there is a compatible migration path to achieve it in a reasonable manner... from future import NaNsanity ... :)
On Thu, Apr 28, 2011 at 5:27 PM, Glenn Linderman
Without having read the original articulations by Raymond or any discussions of the pros and cons,
In my first post to this thread, I pointed out the bug tracker item (http://bugs.python.org/issue4296) that included the discussion of restoring this behaviour to the 3.x branch, after it was inadvertently removed. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Related to the discussion on "Not a Number" can I point out a few things that have not be explicitly addressed so far. The IEEE standard is about hardware and bit patterns, rather than types and values so may not be entirely appropriate for high-level language like Python. NaN is *not* a number (the clue is in the name). Python treats it as if it were a number:
import numbers isinstance(nan, numbers.Number) True
Can be read as "'Not a Number' is a Number" ;) NaN does not have to be a float or a Decimal. Perhaps it should have its own class. The default comparisons will then work as expected for collections. (No doubt, making NaN a new class will cause a whole new set of problems) As pointed out by Meyer: NaN == NaN is False is no more logical than NaN != NaN is False Although both NaN == NaN and NaN != NaN could arguably be a "maybe" value, the all important reflexivity (x == x is True) is effectively part of the language. All collections rely on it and Python wouldn't be much use without dicts, tuples and lists. To summarise: NaN is required so that floating point operations on arrays and lists do not raise unwanted exceptions. NaN is Not a Number (therefore should be neither a float nor a Decimal). Making it a new class would solve some of the problems discussed, but would create new problems instead. Correct behaviour of collections is more important than IEEE conformance of NaN comparisons. Mark.
Mark Shannon wrote:
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Perhaps, but that wouldn't solve anything on its own. If this new class compares reflexively, then it still violates IEE754. Conversely, existing NaNs could be made to compare reflexively without making them a new class. -- Greg
On Thu, Apr 28, 2011 at 7:17 PM, Greg Ewing
Mark Shannon wrote:
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Perhaps, but that wouldn't solve anything on its own. If this new class compares reflexively, then it still violates IEE754. Conversely, existing NaNs could be made to compare reflexively without making them a new class.
And 3rd party NaNs can still do whatever the heck they want :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Mark Shannon wrote:
Related to the discussion on "Not a Number" can I point out a few things that have not be explicitly addressed so far.
The IEEE standard is about hardware and bit patterns, rather than types and values so may not be entirely appropriate for high-level language like Python.
I would argue that the implementation of NANs is irrelevant. If NANs are useful in hardware floats -- and I think they are -- then they're just as equally useful as objects, or as strings in languages like REXX or Hypertalk where all data is stored as strings, or as quantum wave functions in some future quantum computer.
NaN is *not* a number (the clue is in the name). Python treats it as if it were a number:
import numbers isinstance(nan, numbers.Number) True
Can be read as "'Not a Number' is a Number" ;)
I see your wink, but what do you make of these? class NotAnObject(object): pass nao = NotAnObject() assert isinstance(nao, object) class NotAType(object): pass assert type(NotAType) is type
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Others have already pointed out this won't make any difference. Fundamentally, the problem is that some containers bypass equality tests for identity tests. There may be good reasons for that shortcut, but it leads to problems with *any* object that does not define equality to be reflexive, not just NANs.
class Null: ... def __eq__(self, other): ... return False ... null = Null() null == null False [null] == [null] True
The default comparisons will then work as expected for collections. (No doubt, making NaN a new class will cause a whole new set of problems)
As pointed out by Meyer: NaN == NaN is False is no more logical than NaN != NaN is False
I don't agree with this argument. I think Meyer is completely mistaken there. The question of NAN equality is that of a vacuous truth, quite similar to the Present King of France: http://en.wikipedia.org/wiki/Present_King_of_France Meyer would have us accept that: The present King of France is a talking horse and The present King of France is not a talking horse are equally (pun not intended) valid. No, no they're not. I don't know much about who the King of France would be if France had a king, but I do know that he wouldn't be a talking horse. Once you accept that NANs aren't equal to anything else, it becomes a matter of *practicality beats purity* to accept that they can't be equal to themselves either. A NAN doesn't represent a specific thing. It's a signal that your calculation has generated an indefinite, undefined, undetermined value. NANs aren't equal to anything. The fact that a NAN happens to have an existence as a bit-pattern at some location, or as a distinct object, is an implementation detail that is irrelevant. If you just happen by some fluke to compare a NAN to "itself", that shouldn't change the result of the comparison: The present King of France is the current male sovereign who rules France is still false, even if you happen to write it like this: The present King of France is the present King of France This might seem surprising to those who are used to reflexivity. Oh well. Just because reflexivity holds for actual things, doesn't mean it holds for, er, things that aren't things. NANs are things that aren't things.
Although both NaN == NaN and NaN != NaN could arguably be a "maybe" value, the all important reflexivity (x == x is True) is effectively part of the language. All collections rely on it and Python wouldn't be much use without dicts, tuples and lists.
Perhaps they shouldn't rely on it. Identity tests are an implementation detail. But in any case, reflexivity is *not* a guarantee of Python. With rich comparisons, you can define __eq__ to do anything you like. -- Steven
Steven D'Aprano wrote:
Mark Shannon wrote:
Related to the discussion on "Not a Number" can I point out a few things that have not be explicitly addressed so far.
The IEEE standard is about hardware and bit patterns, rather than types and values so may not be entirely appropriate for high-level language like Python.
I would argue that the implementation of NANs is irrelevant. If NANs are useful in hardware floats -- and I think they are -- then they're just as equally useful as objects, or as strings in languages like REXX or Hypertalk where all data is stored as strings, or as quantum wave functions in some future quantum computer.
So, Indeed, so its OK if type(NaN) != type(0.0) ?
NaN is *not* a number (the clue is in the name). Python treats it as if it were a number:
import numbers isinstance(nan, numbers.Number) True
Can be read as "'Not a Number' is a Number" ;)
I see your wink, but what do you make of these?
class NotAnObject(object): pass
nao = NotAnObject() assert isinstance(nao, object)
Trying to make something not an object in a language where everything is an object is bound to be problematic.
class NotAType(object): pass
assert type(NotAType) is type
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Others have already pointed out this won't make any difference.
Fundamentally, the problem is that some containers bypass equality tests for identity tests. There may be good reasons for that shortcut, but it leads to problems with *any* object that does not define equality to be reflexive, not just NANs.
class Null: ... def __eq__(self, other): ... return False ... null = Null() null == null False [null] == [null] True
Just because you can do that, doesn't mean you should. Equality should be reflexive, without that fundamental assumption many non-numeric algorithms fall apart.
The default comparisons will then work as expected for collections. (No doubt, making NaN a new class will cause a whole new set of problems)
As pointed out by Meyer: NaN == NaN is False is no more logical than NaN != NaN is False
I don't agree with this argument. I think Meyer is completely mistaken there. The question of NAN equality is that of a vacuous truth, quite similar to the Present King of France:
http://en.wikipedia.org/wiki/Present_King_of_France
Meyer would have us accept that:
The present King of France is a talking horse
and
The present King of France is not a talking horse
are equally (pun not intended) valid. No, no they're not. I don't know much about who the King of France would be if France had a king, but I do know that he wouldn't be a talking horse.
Once you accept that NANs aren't equal to anything else, it becomes a matter of *practicality beats purity* to accept that they can't be equal
Not breaking a whole bunch of collections and algorithms has a certain practical appeal as well ;)
to themselves either. A NAN doesn't represent a specific thing. It's a signal that your calculation has generated an indefinite, undefined, undetermined value. NANs aren't equal to anything. The fact that a NAN happens to have an existence as a bit-pattern at some location, or as a distinct object, is an implementation detail that is irrelevant. If you just happen by some fluke to compare a NAN to "itself", that shouldn't change the result of the comparison:
The present King of France is the current male sovereign who rules France
is still false, even if you happen to write it like this:
The present King of France is the present King of France
The problem with this argument is the present King of France does not exist, whereas NaN (as a Python object) does exist. The present King of France argument only applies to non-existent things. Python objects do exist (as much as any computer language entity exists). So the expression "The present King of France" either raises an exception (non-existence) or evaluates to an object (existence). In this case "the present King of France" doesn't exist and should raise a FifthRepublicException :) inf / inf does not raise an exception, but evaluates to NaN, so NaN exists. For objects (that exist): (x is x) is True. The present President of France is the present President of France, regardless of who he or she may be.
This might seem surprising to those who are used to reflexivity. Oh well. Just because reflexivity holds for actual things, doesn't mean it holds for, er, things that aren't things. NANs are things that aren't things.
A NaN is thing that *is* a thing; it exists: object.__repr__(float('nan'))
Of course if inf - inf, inf/inf raised exceptions, then NaN wouldn't exist (as a Python object) and the problem would just go away :) After all 0.0/0.0 already raises an exception, but the IEEE defines 0.0/0.0 as NaN.
Although both NaN == NaN and NaN != NaN could arguably be a "maybe" value, the all important reflexivity (x == x is True) is effectively part of the language. All collections rely on it and Python wouldn't be much use without dicts, tuples and lists.
Perhaps they shouldn't rely on it. Identity tests are an implementation detail. But in any case, reflexivity is *not* a guarantee of Python. With rich comparisons, you can define __eq__ to do anything you like.
And if you do define __eq__ to be non-reflexive then things will break. Should an object that breaks so much (ie NaN in its current form) be in the standard library? Perhaps we should just get rid of it?
Mark Shannon wrote:
Steven D'Aprano wrote:
Mark Shannon wrote:
Related to the discussion on "Not a Number" can I point out a few things that have not be explicitly addressed so far.
The IEEE standard is about hardware and bit patterns, rather than types and values so may not be entirely appropriate for high-level language like Python.
I would argue that the implementation of NANs is irrelevant. If NANs are useful in hardware floats -- and I think they are -- then they're just as equally useful as objects, or as strings in languages like REXX or Hypertalk where all data is stored as strings, or as quantum wave functions in some future quantum computer.
So, Indeed, so its OK if type(NaN) != type(0.0) ?
Sure. But that just adds complexity without actually resolving anything.
Fundamentally, the problem is that some containers bypass equality tests for identity tests. There may be good reasons for that shortcut, but it leads to problems with *any* object that does not define equality to be reflexive, not just NANs. [...] Just because you can do that, doesn't mean you should. Equality should be reflexive, without that fundamental assumption many non-numeric algorithms fall apart.
So what? If I have a need for non-reflexivity in my application, why should I care that some other algorithm, which I'm not using, will fail? Python supports non-reflexivity. If I take advantage of that feature, I can't guarantee that *other objects* will be smart enough to understand this. This is no different from any other property of my objects.
The default comparisons will then work as expected for collections. (No doubt, making NaN a new class will cause a whole new set of problems)
As pointed out by Meyer: NaN == NaN is False is no more logical than NaN != NaN is False
I don't agree with this argument. I think Meyer is completely mistaken there. The question of NAN equality is that of a vacuous truth, quite similar to the Present King of France:
http://en.wikipedia.org/wiki/Present_King_of_France [...] The problem with this argument is the present King of France does not exist, whereas NaN (as a Python object) does exist.
NANs (as Python objects) exist in the same way as the present King of France exists as words. It's an implementation detail: we can't talk about the non-existent present King of France without using words, and we can't do calculations on non-existent/indeterminate values in Python without objects. Words can represent things that don't exist, and so can bit-patterns or objects or any other symbol. We must be careful to avoid mistaking the symbol (the NAN bit-pattern or object) for the thing (the result of whatever calculation generated that NAN). The idea of equality we care about is equality of what the symbol represents, not the symbol itself. The meaning of "spam and eggs" should not differ according to the typeface we write the words in. Likewise the number 42 should not differ according to how the int object is laid out, or whether the bit-pattern is little-endian or big-endian. What matters is the "thing" itself, 42, not the symbol: it will still be 42 even if we decided to write it in Roman numerals or base 13. Likewise, what matters is the non-thingness of NANs, not the fact that the symbol for them has an existence as an object or a bit-pattern. -- Steven
On 28/04/2011 15:58, Steven D'Aprano wrote:
Fundamentally, the problem is that some containers bypass equality tests for identity tests. There may be good reasons for that shortcut, but it leads to problems with *any* object that does not define equality to be reflexive, not just NANs. I say you have that backwards. It is a legitimate shortcut, and any object that (perversely) doesn't define equality to be reflexive leads (unsurprisingly) to problems with it (and with *anything else* that - very reasonably - assumes that identity implies equality).
Mark Shannon wrote:
Although both NaN == NaN and NaN != NaN could arguably be a "maybe" value, the all important reflexivity (x == x is True) is effectively part of the language. All collections rely on it and Python wouldn't be much use without dicts, tuples and lists.
Perhaps they shouldn't rely on it. Identity tests are an implementation detail. But in any case, reflexivity is *not* a guarantee of Python. With rich comparisons, you can define __eq__ to do anything you like.
And you can write True = False (at least in older versions of Python you could). No language stops you from writing stupid programs. In fact I would propose that the language should DEFINE the meaning of "==" to be True if its operands are identical, and only if they are not would it use the comparison operators, thus enforcing reflexivity. (Nothing stops you from writing your own non-reflexive __eq__ and calling it explicitly, and I think it is right that you should have to work harder and be more explicit if you want that behaviour.) Please, please, can we have a bit of common sense and perspective here. No-one (not even a mathematician) except someone from Wonderland would seriously want an object not equal to itself. Regards Rob Cliffe
On 4/28/2011 4:40 AM, Mark Shannon wrote:
NaN is *not* a number (the clue is in the name).
The problem is that the committee itself did not believe or stay consistent with that. In the text of the draft, they apparently refer to Nan as an indefinite, unspecified *number*. Sort of like a random variable with a uniform pseudo* distribution over the reals (* 0 everywhere with integral 1). Or a quantum particle present but smeared out over all space. And that apparently is their rationale for Nan != NaN: an unspecified number will equal another unspecified number with probability 0. The rationale for bool(NaN)==True is that an unspecified *number* will be 0 with probability 0. If Nan truly indicated an *absence* (like 0 and '') then bool(NaN) should be False, I think the committee goofed -- badly. Statisticians used missing value indicators long before the committee existed. They has no problem thinking that the indicator, as an object, equaled itself. So one could write (and I often did through the 1980s) the equivalent of for i,x in enumerate(datavec): if x == XMIS: # singleton missing value indicator for BMDP datavec[i] = default (Statistics packages have no concept of identity different from equality.) If statisticians had made XMIS != XMIS, that obvious code would not have worked, as it will not today for Python. Instead, the special case circumlocution of "if isXMIS(x):" would have been required, adding one more unnecessary function to the list of builtins. NaN is, in its domain, the equivalent of None (== Not a Value), which also serves an an alternative to immediately raising an exception. But like XMIS, None==None. Also, bool(None) is corretly for something that indicates absence.
Python treats it as if it were a number:
As I said, so did the committee, and that was its mistake that we are more or less stuck with.
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Like None
As pointed out by Meyer: NaN == NaN is False is no more logical than NaN != NaN is False
This is wrong if False/True are interpreted as probabilities 0 and 1.
To summarise:
NaN is required so that floating point operations on arrays and lists do not raise unwanted exceptions.
Like None.
NaN is Not a Number (therefore should be neither a float nor a Decimal). Making it a new class would solve some of the problems discussed, but would create new problems instead.
Agreed, if we were starting fresh.
Correct behaviour of collections is more important than IEEE conformance of NaN comparisons.
Also agreed. -- Terry Jan Reedy
Terry Reedy wrote:
I think the committee goofed -- badly. Statisticians used missing value indicators long before the committee existed. They has no problem thinking that the indicator, as an object, equaled itself. So one could write (and I often did through the 1980s) the equivalent of
for i,x in enumerate(datavec): if x == XMIS: # singleton missing value indicator for BMDP datavec[i] = default
But NANs aren't missing values (although some people use them as such, that can be considered abuse of the concept). R distinguishes NANs from missing values: they have a built-in value NaN, and a separate built-in value NA which is the canonical missing value. R also treats comparisons of both special values as a missing value:
NA == NA [1] NA NaN == NaN [1] NA
including reflexivity:
x = NA x == x [1] NA
which strikes me as the worst of both worlds, guaranteed to annoy those who want the IEEE behaviour where NANs compare unequal, those like Terry who expect missing values to compare equal to other missing values, and those who want reflexivity to be treated as an invariant no matter what.
NaN is Not a Number (therefore should be neither a float nor a Decimal). Making it a new class would solve some of the problems discussed, but would create new problems instead.
Agreed, if we were starting fresh.
I don't see that making NANs a separate class would make any practical difference what-so-ever, but the point is moot since we're not starting fresh :)
Correct behaviour of collections is more important than IEEE conformance of NaN comparisons.
Also agreed.
To be pedantic, the IEEE standard doesn't have anything to say about comparisons of lists of floats that might contain NANs. Given the current *documented* behaviour that list equality is based on object equality, the actual behaviour is surprising, but I don't think there is anything wrong with the idea of containers assuming that their elements are reflexive. -- Steven
Terry Reedy
On 4/28/2011 4:40 AM, Mark Shannon wrote:
NaN does not have to be a float or a Decimal. Perhaps it should have its own class.
Like None
Would it make sense for ‘NaN’ to be another instance of ‘NoneType’? -- \ “I am too firm in my consciousness of the marvelous to be ever | `\ fascinated by the mere supernatural …” —Joseph Conrad, _The | _o__) Shadow-Line_ | Ben Finney
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*? +Inf and -Inf are arguably useful elements of the algebra, yet Python insists on raising an exception for 1.0./0.0 instead of returning an infinity. Why do this but not raise an exception for any operation that produces a NaN? -- Greg
Greg Ewing wrote:
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
The real question should be, why does Python treat all NANs as signalling NANs instead of quiet NANs? I don't believe this helps anyone.
+Inf and -Inf are arguably useful elements of the algebra, yet Python insists on raising an exception for 1.0./0.0 instead of returning an infinity.
I would argue that Python is wrong to do so. As I've mentioned a couple of times now, 20 years ago Apple felt that NANs and INFs weren't too complicated for non-programmers using Hypercard. There's no sign that Apple were wrong to expose NANs and INFs to users, no flood of Hypercard users confused by NAN inequality. -- Steven
On 4/28/11 8:44 PM, Steven D'Aprano wrote:
Greg Ewing wrote:
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
The real question should be, why does Python treat all NANs as signalling NANs instead of quiet NANs? I don't believe this helps anyone.
Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Robert Kern wrote:
On 4/28/11 8:44 PM, Steven D'Aprano wrote:
Greg Ewing wrote:
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
The real question should be, why does Python treat all NANs as signalling NANs instead of quiet NANs? I don't believe this helps anyone.
Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs.
Sorry, did I get that backwards? I thought it was signalling NANs that cause a signal (in Python terms, an exception)? If I do x = 0.0/0 I get an exception instead of a NAN. Hence a signalling NAN. -- Steven
Steven D'Aprano
Robert Kern wrote:
On 4/28/11 8:44 PM, Steven D'Aprano wrote:
The real question should be, why does Python treat all NANs as signalling NANs instead of quiet NANs? I don't believe this helps anyone.
Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs.
Sorry, did I get that backwards? I thought it was signalling NANs that cause a signal (in Python terms, an exception)?
If I do x = 0.0/0 I get an exception instead of a NAN. Hence a signalling NAN.
Robert has interpreted your “treats all NaNs as signalling NaNs” to mean “treats all objects that Python calls a NaN as signalling NaNs”, and is pointing out that no, the objects that Python calls “NaN” are all quiet NaNs. You might be clearer if you distinguish between what Python calls a NaN and what you call a NaN. It seems you're saying that some Python exception objects (e.g. ZeroDivisionError objects) are what you call NaNs, despite the fact that they're not what Python calls a NaN. -- \ “We can't depend for the long run on distinguishing one | `\ bitstream from another in order to figure out which rules | _o__) apply.” —Eben Moglen, _Anarchism Triumphant_, 1999 | Ben Finney
Ben Finney wrote:
Steven D'Aprano
writes: Robert Kern wrote:
On 4/28/11 8:44 PM, Steven D'Aprano wrote:
The real question should be, why does Python treat all NANs as signalling NANs instead of quiet NANs? I don't believe this helps anyone. Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs. Sorry, did I get that backwards? I thought it was signalling NANs that cause a signal (in Python terms, an exception)?
If I do x = 0.0/0 I get an exception instead of a NAN. Hence a signalling NAN.
Robert has interpreted your “treats all NaNs as signalling NaNs” to mean “treats all objects that Python calls a NaN as signalling NaNs”, and is pointing out that no, the objects that Python calls “NaN” are all quiet NaNs.
I'm sorry for my lack of clarity. I'm referring to functions which potentially produce NANs, not the exceptions themselves. A calculation which might have produced a (quiet) NAN as the result instead raises an exception (which I'm treating as equivalent to a signal). -- Steven
Steven D'Aprano
I'm sorry for my lack of clarity. I'm referring to functions which potentially produce NANs, not the exceptions themselves. A calculation which might have produced a (quiet) NAN as the result instead raises an exception (which I'm treating as equivalent to a signal).
Yes, it produces a Python exception, which is not a Python NaN. If you want to talk about “signalling NaNs”, you'll have to distinguish that (every time!) so you're not misunderstood as referring to a Python NaN object. -- \ “It's my belief we developed language because of our deep inner | `\ need to complain.” —Jane Wagner, via Lily Tomlin | _o__) | Ben Finney
On Fri, Apr 29, 2011 at 3:28 PM, Steven D'Aprano
Robert Kern wrote:
Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs.
Sorry, did I get that backwards? I thought it was signalling NANs that cause a signal (in Python terms, an exception)?
If I do x = 0.0/0 I get an exception instead of a NAN. Hence a signalling NAN.
Aside from the divide-by-zero case, we treat NaNs as quiet NaNs. This is largely due to the fact float operations are delegated to the underlying CPU, and SIGFPE is ignored by default. You can fiddle with it either by building and using the fpectl module, or else by switching to decimal.Decimal() instead (which offers much finer control over signalling through its thread local context information). The latter is by far the preferable course, unless you're targeting specific hardware with well-defined FPE behaviour. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/29/11 1:35 AM, Nick Coghlan wrote:
On Fri, Apr 29, 2011 at 3:28 PM, Steven D'Aprano
wrote: Robert Kern wrote:
Actually, Python treats all NaNs as quiet NaNs and never signalling NaNs.
Sorry, did I get that backwards? I thought it was signalling NANs that cause a signal (in Python terms, an exception)?
If I do x = 0.0/0 I get an exception instead of a NAN. Hence a signalling NAN.
Aside from the divide-by-zero case, we treat NaNs as quiet NaNs.
And in fact, 0.0/0.0 is covered by the more general rule that x/0.0 raises ZeroDivisionError, not a rule that converts IEEE-754 INVALID exceptions into Python exceptions. Other operations that produce a NaN and issue an IEEE-754 INVALID signal do not raise a Python exception. But that's not the difference between signalling NaNs and quiet NaNs. A signalling NaN is one that when it is used as an *input* to an operation, it issues an INVALID signal, not whether a signal is issued when it is the *output* of an operation. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Apr 29, 2011 at 11:31 AM, Robert Kern
And in fact, 0.0/0.0 is covered by the more general rule that x/0.0 raises ZeroDivisionError, not a rule that converts IEEE-754 INVALID exceptions into Python exceptions.
It is unfortunate that official text of IEEE-754 is not freely available and as a result a lot of discussion in this thread is based on imperfect information. I find Kahan's "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" [1] a reasonable reference in the absence of the official text. According to Kahan's notes, INVALID operation is defined as follows: """ Exception: INVALID operation. Signaled by the raising of the INVALID flag whenever an operation's operands lie outside its domain, this exception's default, delivered only because any other real or infinite value would most likely cause worse confusion, is NaN , which means “ Not a Number.” IEEE 754 specifies that seven invalid arithmetic operations shall deliver a NaN unless they are trapped: real √(Negative) , 0*∞ , 0.0/0.0 , ∞/∞, REMAINDER(Anything, 0.0) , REMAINDER( ∞, Anything) , ∞ - ∞ when signs agree (but ∞ + ∞ = ∞ when signs agree). Conversion from floating-point to other formats can be INVALID too, if their limits are violated, even if no NaN can be delivered. """ In contrast, Kahan describes DIVIDE by ZERO exception as "a misnomer perpetrated for historical reasons. A better name for this exception is 'Infinite result computed Exactly from Finite operands.'"
Other operations that produce a NaN and issue an IEEE-754 INVALID signal do not raise a Python exception.
Some do:
math.sqrt(-1) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: math domain error
I think the only exception are the operations involving infinity. The likely rationale is that since infinity is not produced by python arithmetics, those who use inf are likely to expect inf*0 etc. to produce nan. The following seems to be an oversight:
1e300 * 1e300 inf
compared to
1e300 ** 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: (34, 'Result too large')
[1] http://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps
On Fri, Apr 29, 2011 at 11:35, Alexander Belopolsky
On Fri, Apr 29, 2011 at 11:31 AM, Robert Kern
wrote: .. And in fact, 0.0/0.0 is covered by the more general rule that x/0.0 raises ZeroDivisionError, not a rule that converts IEEE-754 INVALID exceptions into Python exceptions.
It is unfortunate that official text of IEEE-754 is not freely available and as a result a lot of discussion in this thread is based on imperfect information.
I find Kahan's "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" [1] a reasonable reference in the absence of the official text. According to Kahan's notes, INVALID operation is defined as follows:
""" Exception: INVALID operation.
Signaled by the raising of the INVALID flag whenever an operation's operands lie outside its domain, this exception's default, delivered only because any other real or infinite value would most likely cause worse confusion, is NaN , which means “ Not a Number.” IEEE 754 specifies that seven invalid arithmetic operations shall deliver a NaN unless they are trapped:
real √(Negative) , 0*∞ , 0.0/0.0 , ∞/∞, REMAINDER(Anything, 0.0) , REMAINDER( ∞, Anything) , ∞ - ∞ when signs agree (but ∞ + ∞ = ∞ when signs agree).
Conversion from floating-point to other formats can be INVALID too, if their limits are violated, even if no NaN can be delivered. """
In contrast, Kahan describes DIVIDE by ZERO exception as "a misnomer perpetrated for historical reasons. A better name for this exception is 'Infinite result computed Exactly from Finite operands.'"
Nonetheless, the reason that *Python* raises a ZeroDivisionError is because it checks that the divisor is 0.0, not because 0.0/0.0 would issue an INVALID signal. I didn't mean that 0.0/0.0 is a "Division by Zero" error as defined in IEEE-754. This is another area where Python ignores the INVALID signal and does its own thing.
Other operations that produce a NaN and issue an IEEE-754 INVALID signal do not raise a Python exception.
Some do:
math.sqrt(-1) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: math domain error
Right. Elsewhere I gave a more exhaustive list including this one. The other is int(nan), though that becomes a Python exception for a more fundamental reason (there is no integer value that can represent it) than that the IEEE-754 standard specifies that the operation should signal INVALID. Arithmetic operations on signalling NaNs don't raise an exception either. These are the minority *exceptions* to the majority of cases where operations on Python floats that would issue an INVALID signal do not raise Python exceptions. If you want to lump all of the inf-related cases together, that's fine, but arithmetic operations on signalling NaNs and comparisons with NaNs form two more groups of INVALID operations that do not raise Python exceptions. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Apr 29, 2011 at 2:18 AM, Greg Ewing
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
History, I think. There's a c.l.p. message from Tim Peters somewhere saying something along the lines that he'd love to make (e.g.,) 1e300 * 1e300 raise an exception instead of producing an infinity, but dare not for fear of the resulting outcry from people who use the current behaviour. Apologies if I've misrepresented what he actually said---I'm failing to find the exact message at the moment. If it weren't for backwards compatibility, I'd love to see Python raise exceptions instead of producing IEEE special values: IOW, to act as though the divide-by-zero, overflow and invalid_operation FP signals all produce an exception. As a bonus, perhaps there could be a mode that allowed 'nonstop' arithmetic, under which infinities and nans were produced as per IEEE 754: with math.non_stop_arithmetic(): ... But this is python-ideas territory. Mark
On Sat, 30 Apr 2011 08:02:33 +0100
Mark Dickinson
On Fri, Apr 29, 2011 at 2:18 AM, Greg Ewing
wrote: Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
History, I think. There's a c.l.p. message from Tim Peters somewhere saying something along the lines that he'd love to make (e.g.,) 1e300 * 1e300 raise an exception instead of producing an infinity, but dare not for fear of the resulting outcry from people who use the current behaviour. Apologies if I've misrepresented what he actually said---I'm failing to find the exact message at the moment.
If it weren't for backwards compatibility, I'd love to see Python raise exceptions instead of producing IEEE special values: IOW, to act as though the divide-by-zero, overflow and invalid_operation FP signals all produce an exception. As a bonus, perhaps there could be a mode that allowed 'nonstop' arithmetic, under which infinities and nans were produced as per IEEE 754:
with math.non_stop_arithmetic(): ...
But this is python-ideas territory.
I would much prefer this idea than the idea of making NaNs non-orderable. It would break code, but at least it would break in less unexpected and annoying ways. Regards Antoine.
[Greg Ewing]
Taking a step back from all this, why does Python allow NaNs to arise from computations *at all*?
[Mark Dickinson]
History, I think. There's a c.l.p. message from Tim Peters somewhere saying something along the lines that he'd love to make (e.g.,) 1e300 * 1e300 raise an exception instead of producing an infinity, but dare not for fear of the resulting outcry from people who use the current behaviour. Apologies if I've misrepresented what he actually said---I'm failing to find the exact message at the moment.
If it weren't for backwards compatibility, I'd love to see Python raise exceptions instead of producing IEEE special values: IOW, to act as though the divide-by-zero, overflow and invalid_operation FP signals all produce an exception.
Exactly. It's impossible to create a NaN from "normal" inputs without triggering div-by-0 or invalid_operation, and if overflow were also enabled it would likewise be impossible to create an infinity from normal inputs. So, 20 years ago, that's how I arranged Kendall Square Research's default numeric environment: enabled those three exception traps by default, and left the underflow and inexact exception traps disabled by default. It's not just "naive" users initially baffled by NaNs and infinities; most of KSR's customers were heavy duty number crunchers, and they didn't know what to make of them at first either. But experts do find them very useful (after climbing the 754 learning curve), so there was also a simple function call (from all the languages we supported - C, C++, FORTRAN and Pascal), to establish the 754 default all-traps-disabled mode:
As a bonus, perhaps there could be a mode that allowed 'nonstop' arithmetic, under which infinities and nans were produced as per IEEE 754:
with math.non_stop_arithmetic(): ...
But this is python-ideas territory.
All of which is just moving toward the numeric environment 754 was aiming for from the start: complete user control over which exception traps are and aren't currently enabled. The only quibble I had with that vision was its baffle-99%-of-users requirement that they _all_ be disabled by default. As Kahan wrote, it's called "an exception" because no matter _what_ you do, someone will take exception to your policy ;-) That's why user control is crucial in a 754 environment. He wanted even more control than 754 recommends (in particular, he wanted the user to be able to specify _which_ value was returned when an exception triggered; e.g., in some apps it may well be more useful for overflow to produce a NaN than an infinity, or to return the largest normal value with the correct sign). Unfortunately, the hardware and academic types who created 754 had no grasp of how difficult it is to materialize their vision in software, and especially not of how very difficult it is to backstitch a pleasant wholly conforming environment into an existing language. As a result, I'm afraid the bulk of 754's features are stilled viewed as "a nuisance" by a vast majority of users :-(
On Fri, Apr 29, 2011 at 11:11 AM, Ben Finney
Would it make sense for ‘NaN’ to be another instance of ‘NoneType’?
This is fine IHMO as I (personally) find myself doing things like: if x is None: ... cheers James -- -- James Mills -- -- "Problems are solved by method"
Terry Reedy writes:
Python treats it as if it were a number:
As I said, so did the committee, and that was its mistake that we are more or less stuck with.
The committee didn't really have a choice. You could ask that they call NaNs something else, but some bit pattern is going to appear in the result register after each computation, and further operations may (try to) use that bit pattern. Seems reasonable to me to apply duck- typing and call those patterns "numbers" for the purpose of IEEE 754, and to define them in such a way that operating on them produces a non-NaN only when *all* numbers (including infinity) produce the same non-NaN. The alternative is to raise an exception whenever a NaN would be generated (but something is still going to appear in the register; I don't know any number that should be put there, do you?) That is excessively punishing to Python users and programmers, though, since Python handles exceptions by terminating the computation. (Kahan points out that signaling NaNs are essentially never used for this reason.) Other aspects of NaN behavior may be a mistake. But it's not clear to me, even after all the discussion in this thread.
On Fri, Apr 29, 2011 at 12:10 AM, Stephen J. Turnbull
Other aspects of NaN behavior may be a mistake. But it's not clear to me, even after all the discussion in this thread.
ISTM that the current behavior of NaN (never mind the identity issue) helps numeric experts write better code. For naive users, however, it causes puzzlement if they ever run into it. Decimal, for that reason, has a context that lets one specify different behaviors when a NaN is produced. Would it make sense to add a float context that also lets one specify what should happen? That could include returning Inf for 1.0/0.0 (for experts), or raising exceptions when NaNs are produced (for the numerically naive like myself). I could see a downside too, e.g. the correctness of code that passingly uses floats might be affected by the context settings. There's also the question of whether the float context should affect int operations; floats vs. ints is another can of worms since (in Python 3) we attempt to tie them together through 1/2 == 0.5, but ints have a much larger range than floats. -- --Guido van Rossum (python.org/~guido)
On Fri, Apr 29, 2011 at 1:11 PM, Guido van Rossum
… Would it make sense to add a float context that also lets one specify what should happen? That could include returning Inf for 1.0/0.0 (for experts), or raising exceptions when NaNs are produced (for the numerically naive like myself).
ISTM, this is approaching py4k territory. Adding contexts will not solve backward compatibility problem unless you introduce a "quirks" contexts that would preserve current warts and make it default. For what it's worth, I think the next major version of Python should use decimal as its main floating point type an leave binary floats to numerical experts.
On Sat, Apr 30, 2011 at 3:11 AM, Guido van Rossum
Decimal, for that reason, has a context that lets one specify different behaviors when a NaN is produced. Would it make sense to add a float context that also lets one specify what should happen? That could include returning Inf for 1.0/0.0 (for experts), or raising exceptions when NaNs are produced (for the numerically naive like myself).
I could see a downside too, e.g. the correctness of code that passingly uses floats might be affected by the context settings. There's also the question of whether the float context should affect int operations; floats vs. ints is another can of worms since (in Python 3) we attempt to tie them together through 1/2 == 0.5, but ints have a much larger range than floats.
Given that we delegate most float() behaviour to the underlying CPU and C libraries (and then the math module tries to cope with any cross-platform discrepancies), introducing context handling isn't easy, and would likely harm the current speed advantage that floats hold over the decimal module. We decided that losing the speed advantage of native integers was worthwhile in order to better unify the semantics of int and long for Py3k, but both the speed differential and the semantic gap between float() and decimal.Decimal() are significantly larger. However, I did find Terry's suggestion of using the warnings module to report some of the floating point corner cases that currently silently produce unexpected results to be an interesting one. If those operations issued a FloatWarning, then users could either silence them or turn them into errors as desired. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 5/1/2011 7:27 AM, Nick Coghlan wrote:
However, I did find Terry's suggestion of using the warnings module to report some of the floating point corner cases that currently silently produce unexpected results to be an interesting one. If those operations issued a FloatWarning, then users could either silence them or turn them into errors as desired.
I would like to take credit for that, but I was actually seconding Alexander's insight and idea. I may have added the specific name after looking at the currently list and seeing UnicodeWarning and BytesWarning, so why not a FloatWarning. I did read the warnings doc more carefully to verify that it would really put the user in control, which was apparently the intent of the committee. I am not sure whether FloatWarnings should ignored or printed by default. Ignored would, I guess, match current behavior, unless something else is changed as part of a more extensive overhaul. -f and -ff are available to turn ignored FloatWarning into print or raise exception, as with BytesWarning. I suspect that these would get at lease as much usage as -b and -bb. So I see 4 questions: 1. Add FloatWarning? 2. If yes, default disposition? 3. Add command line options? 4. Use the addition of FloatWarning as an opportunity to change other defaults, given that user will have more options? -- Terry Jan Reedy
On 4/28/2011 12:32 AM, Nick Coghlan wrote:
On Thu, Apr 28, 2011 at 5:27 PM, Glenn Linderman
wrote: Without having read the original articulations by Raymond or any discussions of the pros and cons, In my first post to this thread, I pointed out the bug tracker item (http://bugs.python.org/issue4296) that included the discussion of restoring this behaviour to the 3.x branch, after it was inadvertently removed.
Sure. I had read that. It was mostly discussing it from a backward compatibility perspective, although it mentioned some invariants as well, etc. But mentioning the invariants is different than reading discussion about the pros and cons of such, or what reasoning lead to wanting them to be invariants. Raymond does make a comment about necessary for correctly reasoning about programs, but that is just a tautological statement based on previous agreement, rather than being the discussion itself, which must have happened significantly earlier. One of your replies to Alexander seems to say the same thing I was saying, though.... On 4/28/2011 12:57 AM, Nick Coghlan wrote:
invariants is violated, but would still work if the data contains NaNs? Sure, anything that cares more about objects than it does about values. The invariants are about making containers behave like containers as far as possible, even in the face of recalcitrant types
On Thu, Apr 28, 2011 at 5:30 PM, Alexander Belopolsky
wrote: Can you give examples of algorithms that would break if one of your like IEEE754 floating point.
That reinforces the idea that the discussion about containers was to try to make them like containers in pre-NaN languages such as Eiffel, rather than in post-NaN languages such as SQL. It is not that one cannot reason about containers in either case, but rather that one cannot borrow all the reasoning from pre-NaN concepts and apply it to post-NaN concepts. So if one's experience is with pre-NaN container concepts, one pushes that philosophy and reasoning instead of embracing and extending post-NaN concepts. That's not all bad, except when the documentation says one thing and the implementation does something else. Your comment in that same message "we can contain the damage to some degree" speaks to that philosophy. Based on my current limited knowledge of Python internals, and available time to pursue figuring out whether the compatibility issues would preclude extending Python containers to embrace post-NaN concepts, I'll probably just learn your list of invariants, and just be aware that if I need a post-NaN container, I'll have to implement it myself. I suspect doing sequences would be quite straightforward, other containers less so, unless the application of concern is sufficiently value-based to permit the trick of creating a new NaN each time it is inserted into a different container.
On Thu, Apr 28, 2011 at 2:54 AM, Nick Coghlan
No, as Raymond has articulated a number of times over the years, it's a property of the equivalence relation that is needed in order to present sane invariants to users of the container. I included in the bug report the critical invariants I am currently aware of that should hold, even when the container may hold types with a non-reflexive definition of equality:
assert [x] == [x] # Generalised to all container types assert not [x] != [x] # Generalised to all container types for x in c: assert x in c assert c.count(x) > 0 # If applicable assert 0 <= c.index(x) < len(c) # If applicable
It is an interesting question of what "sane invariants" are. Why you consider the invariants that you listed essential while say if c1 == c2: assert all(x == y for x,y in zip(c1, c2)) optional? Can you give examples of algorithms that would break if one of your invariants is violated, but would still work if the data contains NaNs?
On Thu, Apr 28, 2011 at 5:30 PM, Alexander Belopolsky
On Thu, Apr 28, 2011 at 2:54 AM, Nick Coghlan
wrote: .. No, as Raymond has articulated a number of times over the years, it's a property of the equivalence relation that is needed in order to present sane invariants to users of the container. I included in the bug report the critical invariants I am currently aware of that should hold, even when the container may hold types with a non-reflexive definition of equality:
assert [x] == [x] # Generalised to all container types assert not [x] != [x] # Generalised to all container types for x in c: assert x in c assert c.count(x) > 0 # If applicable assert 0 <= c.index(x) < len(c) # If applicable
It is an interesting question of what "sane invariants" are. Why you consider the invariants that you listed essential while say
if c1 == c2: assert all(x == y for x,y in zip(c1, c2))
optional?
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly. The correct assertion under Python's current container semantics is: if list(c1) == list(c2): # Make ordering assumption explicit assert all(x is y or x == y for x,y in zip(c1, c2)) # Enforce reflexivity Meyer is a purist - sticking with the mathematical definition of equality is the sort of thing that fits his view of the world and what Eiffel should be, even if it hinders interoperability with other languages and tools. Python tends to be a bit more pragmatic about things, in particular when it comes to interoperability, so it makes sense to follow IEEE754 and the decimal specification at the individual comparison level. However, we can contain the damage to some degree by specifying that containers should enforce reflexivity where they need it. This is already the case at the implementation level (collections.Sequence aside), it just needs to be pushed up to the language definition level.
Can you give examples of algorithms that would break if one of your invariants is violated, but would still work if the data contains NaNs?
Sure, anything that cares more about objects than it does about values. The invariants are about making containers behave like containers as far as possible, even in the face of recalcitrant types like IEEE754 floating point. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Apr 28, 2011 at 3:57 AM, Nick Coghlan
It is an interesting question of what "sane invariants" are. Why you consider the invariants that you listed essential while say
if c1 == c2: assert all(x == y for x,y in zip(c1, c2))
optional?
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
AFAIK, IEEE754 says nothing about comparison of containers, so my invariant cannot violate it. What you probably wanted to say is that my invariant cannot be achieved in the presence of IEEE754 conforming floats, but this observation by itself does not make my invariant less important than yours. It just makes yours easier to maintain.
The correct assertion under Python's current container semantics is:
if list(c1) == list(c2): # Make ordering assumption explicit assert all(x is y or x == y for x,y in zip(c1, c2)) # Enforce reflexivity
Being correct is different from being important. What practical applications of lists containing NaNs do this and your other invariants enable? I think even with these invariants in place one should either filter out NaNs from their lists or replace them with None before doing applying container operations.
On Thu, Apr 28, 2011 at 6:30 PM, Alexander Belopolsky
On Thu, Apr 28, 2011 at 3:57 AM, Nick Coghlan
wrote: .. It is an interesting question of what "sane invariants" are. Why you consider the invariants that you listed essential while say
if c1 == c2: assert all(x == y for x,y in zip(c1, c2))
optional?
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
AFAIK, IEEE754 says nothing about comparison of containers, so my invariant cannot violate it. What you probably wanted to say is that my invariant cannot be achieved in the presence of IEEE754 conforming floats, but this observation by itself does not make my invariant less important than yours. It just makes yours easier to maintain.
No, I meant what I said. Your assertion includes a direct comparison between values (the "x == y" part) which means that IEEE754 has a bearing on whether or not it is a valid assertion. Every single one of my stated invariants consists solely of relationships between containers, or between a container and its contents. This keeps them all out of the domain of IEEE754 since the *container implementers* get to decide whether or not to factor object identity into the management of the container contents. The core containment invariant is really only this one: for x in c: assert x in c That is, if we iterate over a container, all entries returned should be in the container. Hopefully it is non-controversial that this is a sane and reasonable invariant for a container *user* to expect. The comparison invariants follow from the definition of set equivalence as: set1 == set2 iff all(x in set2 for x in set1) and all(y in set1 for y in set2) Again, notice that there is no comparison of items here - merely a consideration of the way items relate to containers. The rationale behind the count() and index() assertions is harder to define in implementation neutral terms, but their behaviour does follow naturally from the internal enforcement of reflexivity needed to guarantee that core invariant. In mathematics, this is all quite straightforward and non-controversial, since it can be taken for granted that equality is reflexive (as it's part of the definition of what equality *means* - equivalence relations *are* relations that are symmetric, transitive and reflexive. Lose any one of those three properties and it isn't an equivalence relation any more). However, when we confront the practical reality of IEEE754 floating point values and the lack of reflexivity in the presence of NaN, we're faced with a choice of (at least) 4 alternatives: 1. Deny it. Say equality is reflexive at the language level, and we don't care that it makes it impossible to fully implement IEEE754 semantics. This is what Eiffel does, and if you don't care about interoperability and the possibility of algorithmic equivalence with hardware implementations, it's probably not a bad idea. After all, why discard centuries of mathematical experience based on a decision that the IEEE754 committee can't clearly recall the rationale for, and didn't clearly document? 2. Tolerate it, but attempt to confine the breakage of mathematical guarantees to the arithmetic operations actually covered by the relevant standards. This is what CPython currently does by enforcing the container invariants at an implementation level, and, as I think it's a good way to handle the situation, this is what I am advocating lifting up to the language level through appropriate updates to the library and language reference. (Note that even changing the behaviour of float() leaves Python in this situation, since third party types will still be free to follow IEEE754. Given that, it seems relatively pointless to change the behaviour of builtin floats after all the effort that has gone into bringing them ever closer to IEEE754). 3. Signal it. We already do this in some cases (e.g. for ZeroDivisionError), and I'm personally quite happy with the idea of raising ValueError in other cases, such as when attempting to perform ordering comparisons on NaN values. 4. Embrace it. Promote NaN to a language level construct, define semantics allowing it to propagate through assorted comparison and other operations (including short-circuiting logic operators) without being coerced to True as it is now. Documenting the status quo is the *only* necessary step in all of this (and Raymond has already adopted the relevant tracker issue). There are tweaks to the current semantics that may be useful (specifically ValueError when attempting to order NaN), but changing the meaning of equality for floats probably isn't one of them (since that only fixes one type, while fixing the affected algorithms fixes *all* types). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/28/2011 6:11 AM, Nick Coghlan wrote:
On Thu, Apr 28, 2011 at 6:30 PM, Alexander Belopolsky
wrote: On Thu, Apr 28, 2011 at 3:57 AM, Nick Coghlan
wrote: .. It is an interesting question of what "sane invariants" are. Why you consider the invariants that you listed essential while say
if c1 == c2: assert all(x == y for x,y in zip(c1, c2))
optional?
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
AFAIK, IEEE754 says nothing about comparison of containers, so my invariant cannot violate it. What you probably wanted to say is that my invariant cannot be achieved in the presence of IEEE754 conforming floats, but this observation by itself does not make my invariant less important than yours. It just makes yours easier to maintain.
No, I meant what I said. Your assertion includes a direct comparison between values (the "x == y" part) which means that IEEE754 has a bearing on whether or not it is a valid assertion. Every single one of my stated invariants consists solely of relationships between containers, or between a container and its contents. This keeps them all out of the domain of IEEE754 since the *container implementers* get to decide whether or not to factor object identity into the management of the container contents.
The core containment invariant is really only this one:
for x in c: assert x in c
That is, if we iterate over a container, all entries returned should be in the container. Hopefully it is non-controversial that this is a sane and reasonable invariant for a container *user* to expect.
The comparison invariants follow from the definition of set equivalence as:
set1 == set2 iff all(x in set2 for x in set1) and all(y in set1 for y in set2)
Again, notice that there is no comparison of items here - merely a consideration of the way items relate to containers.
I agree that the container (author) gets to define container equality. The definition should also be correctly documented. 5.9. Comparisons says "Tuples and lists are compared lexicographically using comparison of corresponding elements. This means that to compare equal, each element must compare equal and the two sequences must be of the same type and have the same length.". This, I believe is the same as what Hrvoje said "I would expect l1 == l2, where l1 and l2 are both lists, to be semantically equivalent to len(l1) == len(l2) and all(imap(operator.eq, l1, l2))." But "Currently it isn't, and that was the motivation for this thread." In this case, I think the discrepancy should be fixed by changing the doc. Add 'be identical or ' before 'compare equal'. -- Terry Jan Reedy
Nick Coghlan wrote:
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
Aren't you making something of a circular argument here? You're saying that non-reflexive comparisons are okay because they don't interfere with certain critical invariants. But you're defining those invariants as the ones that don't happen to conflict with non-reflexive comparisons! -- Greg
On Thu, Apr 28, 2011 at 7:10 PM, Greg Ewing
Nick Coghlan wrote:
Because this assertion is an assertion about the behaviour of comparisons that violates IEEE754, while the assertions I list are all assertions about the behaviour of containers that can be made true *regardless* of IEEE754 by checking identity explicitly.
Aren't you making something of a circular argument here? You're saying that non-reflexive comparisons are okay because they don't interfere with certain critical invariants. But you're defining those invariants as the ones that don't happen to conflict with non-reflexive comparisons!
No, I'm taking the existence of non-reflexive comparisons as a given (despite agreeing with Meyer from a theoretical standpoint) because: 1. IEEE754 works that way 2. Even if float() is changed to not work that way, 3rd party types may still do so 3. Supporting rich comparisons makes it impossible for Python to enforce reflexivity at the language level (even if we wanted to) However, as I detailed in my reply to Antoine, the critical container invariants I cite *don't include* direct object-object comparisons. Instead, they merely describe how objects relate to containers, and how containers relate to each other, using only the two basic rules that objects retrieved from a container should be in that container and that two sets are equivalent if they are each a subset of the other. The question then becomes, how do we reconcile the container invariants with the existence of non-reflexive definitions of equality at the type level, and the answer is to officially adopt the approach already used in the standard container types: enforce reflexive equality at the container level, so that it doesn't matter that some types provide a non-reflexive version. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (22)
-
Alexander Belopolsky
-
Antoine Pitrou
-
Ben Finney
-
Ethan Furman
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Hrvoje Niksic
-
Hrvoje Niksic AVL HR
-
Isaac Morland
-
James Mills
-
Mark Dickinson
-
Mark Shannon
-
Nick Coghlan
-
Raymond Hettinger
-
Rob Cliffe
-
Robert Kern
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Peters
-
Łukasz Langa