[Python-ideas] Python Numbers as Human Concept Decimal System
Terry Reedy
tjreedy at udel.edu
Sun Mar 9 23:22:29 CET 2014
On 3/9/2014 3:34 PM, Guido van Rossum wrote:
> I would dearly like to put this thread to rest, as it has strayed
> mightily from the topic of improvements to Python, and all points of
> view have been amply defended. I'm hoping to hear from Cowlishaw, but I
> expect he'll side with one of Mark Dickinson's proposals. I hope that
> somebody will take up the task to write a PEP about introducing a
> decimal literal; that sounds like an obtainable goal, less controversial
> than changing Decimal(<float>).
>
> I did do some more thinking about how the magic repr() affects the
> distribution of values, and came up with an example of sorts that might
> show what it does. We've mostly focused on simple like 1.1, but to
> understand the distribution issue it's better to look at a very large value.
>
> I took 2**49 as an example and added a random fraction. When printed
> this always gives a single digit past the decimal point, e.g.
> 562949953421312.5. Then I measured the distribution of the last digit.
> What I found matched my prediction: the digits 0, 1, 2, 4, 5, 6, 8, 9
> occurred with roughly equal probability (1/8th). So 3 and 7 are
> completely missing.
>
> The explanation is simple enough: using the (current) Decimal class it's
> easy to see that there are only 8 possible actual values, whose
> fractional part is a multiple of 1/8. IOW the exact values end in .000,
> .125, .250, .375, .500, .625, .750, .875. (*) The conclusion is that
> there are only 3 bits represented after the binary point, and repr()
> produces a single digit here, because that's all that's needed to
> correctly round back to the 8 possible values. So it picks the digit
> closest to each of the possible values, and when there are two
> possibilities it picks one. I don't know how it picks, but it is
> reproducible -- in this example it always chooses .2 to represent .250,
> and .8 to represent .750. The exact same thing happens later in the
> decimal expansion for smaller numbers.
It is doing 'round to even'.
> I think that the main objection to this distribution is that, while the
> exact distribution is evenly spaced (the possibilities are 1/8th away
> from each other), the rounded distribution has some gaps of 1/10th and
> some of 1/5th. I am not enough of a statistician to know whether this
> matters (the distribution of *digits* in the exact values is arguably
> less randomized :-) but at least this example clarified my understanding
> of the phenomenon we're talking about when we discuss distribution of
> values.
The fact that the rounding rule produces one round down an one up and a
symmetric distribution about 5 of non-0 final digits ameliorates any
statistical problems (this is the reason for 'round to even'). If we add
0 as a final digit, we must remember (as you noted in your footnote)
that this is sometimes a result of rounding down to 0 and sometimes a
result of rounding up to 10 (with the 1 added into the preceding digit).
I think the statistical issue you framed is this. We have numbers
between 0 and 1 with a distribution that has a mean and variance. The
uniform [0,1] distribution (mean .5, variance 1/12) is the simplest,
though it may or may not be the most realistic in any particular
situation. We have two procedures.
1. Round each number to the nearest n/10, from 0/10 to 10/10.
2. Round each number to the nearest m/8 (0 <= m <= 8) and thence to some
n/10 as described above.
If we compare the mean and variance of the rounded numbers, how much
bias is introduced by the rounding process and is the bias introduced by
the second enough more than by the first to worry about.
For a uniform distribution, there is no bias in the mean of rounded
numbers. As I write this, I suspect there will be a small bias in the
variance, and slightly more in the second, but not enough that I would
worry about it. Let's see.
>>> from statistics import pvariance as pvar
>>> p1 = (0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10)
>>> p2 = (0, 1, 1, 2, 2, 4, 4, 5, 5, 6, 6, 8, 8, 9, 9, 10)
>>> print(1/12, pvar(p1)/100, pvar(p2)/100)
0.08333333333333333 0.085 0.09625
Hmmm. The removal of the .3s and .7s biases the digits away from .5,
making the variance increase more than I expected. However, if I were
looking as numbers from 0.0 to 10.0 (instead of 1.0), about half the .x
double rounded digits would be moved toward 5.0 rather than away. Ditto
for rounding 0 to 1.0 to 2 digits instead of 1. Either way, I would
expect the double rounding bias to be quite reduced.
>>> def end37(n):
return (n % 10) in (3, 7)
>>> p3 = list(range(100)) + list(range(1, 101))
>>> p4 = list(i for i in range(100) if not end37(i)) + list(i for i in
range(1, 101) if not end37(i))
>>> print(1/12, pvar(p3)/10000, pvar(p4)/10000)
0.08333333333333333 0.08335 0.0834625
>>> p5 = list(range(1000)) + list(range(1, 1001))
>>> p6 = list(i for i in range(1000) if not end37(i)) + list(i for i in
range(1, 1001) if not end37(i))
>>> print(1/12, pvar(p5)/1000000, pvar(p6)/1000000)
0.08333333333333333 0.0833335 0.083334625
For any realistic number of significant digits, variance bias for
uniformly distributed numbers is not an issue.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list