[Python-Dev] nice()

Mon Feb 13 18:10:28 CET 2006

| From: Josiah Carlson <jcarlson at uci.edu>
| "Alan Gauld" <alan.gauld at freenet.co.uk> wrote:
|| However I do dislike the name nice() - there is already a nice() in
|| the 
|| os module with a fairly well understood function. 

perhaps trim(), nearly(), about(), defer_the_pain_of() :-) I've waited to think of names until after writing this. The reason for the last name option may become apparent after reading the rest of this post.

|| But I'm sure some
|| time with a thesaurus can overcome that single mild objection. :-)
| 
| Presumably it would be located somewhere like the math module.

I would like to see it as accessible as round, int, float, and repr. I really think a round-from-the-left is a nice tool to have. It's obviously very easy to build your own if you know what tools to use. Not everyone is going to be reading the python-dev or similar lists, however, and so having it handy would be nice.

| From: Greg Ewing <greg.ewing at canterbury.ac.nz>
| Smith wrote:
| 
||     When teaching some programming to total newbies, a common
||     frustration is how to explain why a==b is False when a and b are
||     floats computed by different routes which ``should'' give the
||     same results (if arithmetic had infinite precision).
| 
| This is just a special case of the problems inherent
| in the use of floating point. As with all of these,
| papering over this particular one isn't going to help
| in the long run -- another one will pop up in due
| course.
| 
| Seems to me it's better to educate said newbies not
| to use algorithms that require comparing floats for
| equality at all. 

I think that having a helper function like nice() is a middle ground solution to the problem, falling short of using only decimal or rational values for numbers and doing better than requiring a test of error between floating values that should be equal but aren't because of alternate methods of computation. Just like the argument for having true division being the default behavior for the computational environment, it seems a little unfriendly to expect the more casual user to have to worry that 3*0.1 is not the same as 3/10.0. I know--they really are different, and one should (eventually) understand why, but does anyone really want the warts of floating point representation to be popping up in their work if they could be avoided, or at least easily circumvented?

I know you know why the following numbers show up as not equal, but this would be an example of the pain in working with a reasonably simple exercise of, say, computing the bin boundaries for a histogram where bins are a width of 0.1: 

###
>>> for i in range(20):
...  if (i*.1==i/10.)<>(nice(i*.1)==nice(i/10.)):
...   print i,repr(i*.1),repr(i/10.),i*.1,i/10.
... 
3 0.30000000000000004 0.29999999999999999 0.3 0.3
6 0.60000000000000009 0.59999999999999998 0.6 0.6
7 0.70000000000000007 0.69999999999999996 0.7 0.7
12 1.2000000000000002 1.2 1.2 1.2
14 1.4000000000000001 1.3999999999999999 1.4 1.4
17 1.7000000000000002 1.7 1.7 1.7
19 1.9000000000000001 1.8999999999999999 1.9 1.9
###

For, say, garden variety numbers that aren't full of garbage digits resulting from fp computation, the boundaries computed as 0.1*i are not going to agree with such simple numbers as 1.4 and 0.7.

Would anyone (and I truly don't know the answer) really mind if all floating point values were filtered through whatever lies behind the str() manipulation of floats before the computation was made? I'm not saying that strings would be compared, but that float(str(x)) would be compared to float(str(y)) if x were being compared to y as in x<=y. If this could be done, wouldn't a lot of grief just go away and not require the use of decimal or rational types for many users? 

I understand that the above really is just a patch over the problem, but I'm wondering if it moves the problem far enough away that most users wouldn't have to worry about it. Here, for example, are the first values where the running sum doesn't equal the straight multiple of some step size:

###
>>> def go(x,n=1000):
...  s=0;i=0
...  while s<n:
...   i+=1;s+=x
...   if nice(s)<>nice(i*x):
...    return i,s,i*x,`s`,`i*x`
... 
>>> for i in range(1,100):
...  print i, go(i/1000.)
...  print
...  
1 (60372 60.3719999999 60.372 60.371999999949999 60.372)

2 (49645 99.2899999999 99.29 99.289999999949998 99.290000000000006)
###

The soonest the breakdown occurs is at the 22496th multiple of 0.041 for the range given above. By the time someone starts getting into needs of iterating so many times, they will be ready to use the more sophisticated option of nice()--the one which makes it more versatile and less of a patch--the option to round the answers to a given number of leading digits rather than a given decimal precision like round. nice() gives a simple way to think about making a comparison of floats. You just have to ask yourself at what "part per X" do you no longer care whether the numbers are different or not. e.g., for approximately 1 part in 100, use nice(x,2) and nice(y,2) to make the comparison between x and y. Replacing nice(*) with nice(*,6) in the go() defined above produces no discrepancy in values computed the two different ways. Since the cost of str() and '%.*e' is nearly the same, perhaps a default value of leadingDigits=9 would be a good default value, and the float(str()) option could be eliminated from nice. Isn't nice() sort of a poor-man's decimal-type without all the extra baggage?

| In my opinion, if you ever find
| yourself trying to do this, you're not thinking about
| the problem correctly, and your algorithm is simply
| wrong, even if you had infinitely precise floats.
|

As for real world examples of when this would be nice I will have to rely on others to justify this more heavily. Some quick examples that come to mind are:

* Creating histograms of physical measurements with limited significant digits (i.e., not lots of digits from computation)
* Collecting sets of points within a certain range of a given value (all points within 10% of a given value)
* Stopping iterations when computed errors have fallen below a certain threshold. (For this, getting the stopping condition "right" is not so critical because doing one more iteration usually isn't a problem if an error happens to be a tiny bit larger than the required tolerance. However, the leadingDigits option on nice() allows one to even get this stopping condition right to a limited precision, something like

###
tol = 1e-5
while 1:
    #do something and compute err
    if nice(err,3)<=nice(tol,3):
        break
###

By specifying the leadingDigits value of 3, the user is saying that it's fine to quit when the err >= 0.9995. Since there is no additional cost in specifying more digits, a value of 9 could be used as well.

| Ismael at tutor wrote:
| How about overloading Float comparison? 

I'm not so adept at such things--how easy is this to do for all comparisions in a script? in an interactive session? For the latter, if it were easy, perhaps it could be part of a "newbie" mode that could be loaded. I think that some (one above has said so) would rather not have an issue pushed away, they would want to leave things as they are and just learn to work around it, not be given a hand-holding device that is eventually going to let them down anyway. I'm wondering if just having to use the function to make a comparison will be like putting your helmet on before you cycle--a reminder that there may be hazards ahead, proceed with caution. If so, then overloading the Float comparision would be better not done, requiring the "buckling" of the floats within nice().

| 
| If I have understood correctly, float to float comparison must be done
| comparing relative errors, so that when dealing with small but rightly
| represented numbers it won't tell "True" just because they're
| "close". I 
| think your/their solution only covers the case when dealing with "big"
| numbers.

Think of two small numbers that you think might fail the nice test and then use the leadingDigits option (set at something like 6) and see if the problem doesn't disappear. If I understand you correctly, is this such a case: x and y defined below are truly close and nice()'s default comparison would say they are different, but nice(*,6) would say they are the same--the same to the first 6 digits of the exponential representation:

###
>>> x=1.234567e-7
>>> y=1.234568e-7
>>> nice(x)==nice(y)
False
>>> nice(x,6)==nice(y,6)
True
###

| Chuck Allison wrote on edu-sig:
| There is a reliable way to compute the exact number of floating-point
| "intervals" (one less than the number of FP numbers) between any two
| FP numbers. It is a long-ago solved problem. I have attached a C++
| version. You can't define closeness by a "distance" in a FP system -
| you should use this measure instead (called "ulps" - units in the
| last place). The distance between large FP numbers may always be
| greater than the tolerance you prescribe. The spacing between
| adjacent FP numbers at the top of the scale for IEEE double precision
| numbers is 2^(972) (approx. 10^(293))! I doubt you're going to make
| your tolerance this big. I don't believe newbies can grasp this, but
| they can be taught to get a "feel" for floating-point number systems.
| You can't write reliable FP code without this understanding. See
| http://uvsc.freshsources.com/decimals.pdf.            

A very readable 13 page introduction to some floating point issues. Thanks for the reference. The author concludes with,

"Computer science students don't need to be numerical analysts, but they may be called upon to write mathematical software. Indeed, scientists and engineers use tools like Matlab and Mathematica, but who implements these systems? It takes the expertise that only CS graduates have to write such sophisticated software. Without knowledge of the intricacies of floating-point computation, they will make a mess of things. In this paper I have surveyed the basics that every CS graduate should have mastered before they can be trusted in a workplace that does any kind of computing with real numbers."

So perhaps this brings us back to the original comment that "fp issues are a learning opportunity." They are. The question I have is "how soon do they need to run into them?" Is decreasing the likelihood that they will see the problem (but not eliminate it) a good thing for the python community or not?

/c