[Python-ideas] Python Numbers as Human Concept Decimal System

Sun Mar 9 23:22:29 CET 2014

On 3/9/2014 3:34 PM, Guido van Rossum wrote:
> I would dearly like to put this thread to rest, as it has strayed
> mightily from the topic of improvements to Python, and all points of
> view have been amply defended. I'm hoping to hear from Cowlishaw, but I
> expect he'll side with one of Mark Dickinson's proposals. I hope that
> somebody will take up the task to write a PEP about introducing a
> decimal literal; that sounds like an obtainable goal, less controversial
> than changing Decimal(<float>).
>
> I did do some more thinking about how the magic repr() affects the
> distribution of values, and came up with an example of sorts that might
> show what it does. We've mostly focused on simple like 1.1, but to
> understand the distribution issue it's better to look at a very large value.
>
> I took 2**49 as an example and added a random fraction. When printed
> this always gives a single digit past the decimal point, e.g.
> 562949953421312.5. Then I measured the distribution of the last digit.
> What I found matched my prediction: the digits 0, 1, 2, 4, 5, 6, 8, 9
> occurred with roughly equal probability (1/8th). So 3 and 7 are
> completely missing.
>
> The explanation is simple enough: using the (current) Decimal class it's
> easy to see that there are only 8 possible actual values, whose
> fractional part is a multiple of 1/8. IOW the exact values end in .000,
> .125, .250, .375, .500, .625, .750, .875. (*) The conclusion is that
> there are only 3 bits represented after the binary point, and repr()
> produces a single digit here, because that's all that's needed to
> correctly round back to the 8 possible values. So it picks the digit
> closest to each of the possible values, and when there are two
> possibilities it picks one. I don't know how it picks, but it is
> reproducible -- in this example it always chooses .2 to represent .250,
> and .8 to represent .750. The exact same thing happens later in the
> decimal expansion for smaller numbers.

It is doing 'round to even'.

> I think that the main objection to this distribution is that, while the
> exact distribution is evenly spaced (the possibilities are 1/8th away
> from each other), the rounded distribution has some gaps of 1/10th and
> some of 1/5th. I am not enough of a statistician to know whether this
> matters (the distribution of *digits* in the exact values is arguably
> less randomized :-) but at least this example clarified my understanding
> of the phenomenon we're talking about when we discuss distribution of
> values.

The fact that the rounding rule produces one round down an one up and a 
symmetric distribution about 5 of non-0 final digits ameliorates any 
statistical problems (this is the reason for 'round to even'). If we add 
0 as a final digit, we must remember (as you noted in your footnote) 
that this is sometimes a result of rounding down to 0 and sometimes a 
result of rounding up to 10 (with the 1 added into the preceding digit).

I think the statistical issue you framed is this. We have numbers 
between 0 and 1 with a distribution that has a mean and variance. The 
uniform [0,1] distribution (mean .5, variance 1/12) is the simplest, 
though it may or may not be the most realistic in any particular 
situation.  We have two procedures.

1. Round each number to the nearest n/10, from 0/10 to 10/10.

2. Round each number to the nearest m/8 (0 <= m <= 8) and thence to some 
n/10 as described above.

If we compare the mean and variance of the rounded numbers, how much 
bias is introduced by the rounding process and is the bias introduced by 
the second enough more than by the first to worry about.

For a uniform distribution, there is no bias in the mean of rounded 
numbers. As I write this, I suspect there will be a small bias in the 
variance, and slightly more in the second, but not enough that I would 
worry about it. Let's see.

 >>> from statistics import pvariance as pvar
 >>> p1 = (0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10)
 >>> p2 = (0, 1, 1, 2, 2, 4, 4, 5, 5, 6, 6,  8, 8, 9, 9, 10)
 >>> print(1/12, pvar(p1)/100, pvar(p2)/100)
0.08333333333333333 0.085 0.09625

Hmmm. The removal of the .3s and .7s biases the digits away from .5, 
making the variance increase more than I expected. However, if I were 
looking as numbers from 0.0 to 10.0 (instead of 1.0), about half the .x 
double rounded digits would be moved toward 5.0 rather than away. Ditto 
for rounding 0 to 1.0 to 2 digits instead of 1. Either way, I would 
expect the double rounding bias to be quite reduced.

 >>> def end37(n):
	return (n % 10) in (3, 7)

 >>> p3 = list(range(100)) + list(range(1, 101))
 >>> p4 = list(i for i in range(100) if not end37(i)) + list(i for i in 
range(1, 101) if not end37(i))
 >>> print(1/12, pvar(p3)/10000, pvar(p4)/10000)
0.08333333333333333 0.08335 0.0834625

 >>> p5 = list(range(1000)) + list(range(1, 1001))
 >>> p6 = list(i for i in range(1000) if not end37(i)) + list(i for i in 
range(1, 1001) if not end37(i))
 >>> print(1/12, pvar(p5)/1000000, pvar(p6)/1000000)
0.08333333333333333 0.0833335 0.083334625

For any realistic number of significant digits, variance bias for 
uniformly distributed numbers is not an issue.

-- 
Terry Jan Reedy