[New-bugs-announce] [issue44080] Bias in random.choices(long_sequence)
Dennis Sweeney
report at bugs.python.org
Sat May 8 17:10:38 EDT 2021
New submission from Dennis Sweeney <sweeney.dennis650 at gmail.com>:
Problem: Random.choices() never returns sequence entries at large
odd indices. For example:
>>> import random
>>> [x % 2 for x in random.choices(range(2**53), k=20)]
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1]
>>> [x % 2 for x in random.choices(range(2**54), k=20)]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
This might require a 64-bit build.
Possible solutions:
1) Do nothing.
Document the quirk, and recommend using
[seq[randrange(len(seq))] for i in range(k)]
2) Use integer arithmetic rather than random().
Using _randbelow() in the unweighted case alone breaks
test_random.test_choices_algorithms(), but adding a case for
`if isinstance(total, int)` and using _randbelow() there as
well creates a difference between the reproducability
of weights=[1]*n versus weights=[1.0]*n.
3) Raise an Exception or Warning when loss of precision is likely to occur.
Something like `if n > 2.0 ** BPF: raise OverFlowError(...)`.
All smaller integers are exactly representable, but that doesn't
quite eliminate the bias (see below).
-----------------------
There is bias in random.choices() even when n <= 2**53:
>>> Counter(val % 3 for val in
... choices(range(2**53 - 2**51), k=1_000_000))
Counter({1: 374976, 0: 333613, 2: 291411})
>>> Counter(val % 3 for val in
... choices(range(2**53 - 2**51), k=1_000_000)
... if val < 2**51)
Counter({0: 166669, 1: 83664, 2: 83293})
For some explanation, imagine floats have
only 3 bits of precision. Then random.choices() would do something
like this on a sequence of length 7:
# Random 3-significant-bit floats
def random():
return randrange(8) / 8
# selecting from a pool of size 2**BPF - 1
>>> Counter(floor(random() * 7) for _ in range(1_000_000))
Counter({0: 249566, 5: 125388, 6: 125251, 2: 125202, 1: 125149, 3: 124994, 4: 124450})
Issue: both 0/8 and 7/8 fall in [0, 1).
Similar biases:
>>> Counter(floor(randrange(16)/16*15) for _ in range(1_000_000))
Counter({0: 124549, 5: 62812, 14: 62807, 6: 62766, 10: 62740, 3: 62716, 12: 62658, 13: 62649, 9: 62473, 8: 62376, 2: 62373, 4: 62346, 11: 62293, 1: 62278, 7: 62164})
>>> Counter(floor(randrange(16)/16*14) for _ in range(1_000_000))
Counter({7: 125505, 0: 124962, 11: 62767, 9: 62728, 6: 62692, 10: 62684, 4: 62625, 3: 62465, 12: 62428, 13: 62397, 2: 62332, 8: 62212, 5: 62176, 1: 62027})
>>> def bias(bits, n):
... c = Counter(floor(randrange(2**bits)/(2**bits) * n)
... for _ in range(1_000_000))
... m = min(c.values())
... preferred = [k for k, v in c.items() if v / m > 1.5]
... preferred.sort()
... return preferred
>>> bias(bits=4, n=15)
[0]
>>> bias(bits=4, n=14)
[0, 7]
>>> bias(bits=8, n=250)
[0, 41, 83, 125, 166, 208]
# Observation of which numbers may be biased toward
# I have no proof that this is universally true
>>> [250 * i // (2**8 - 250) for i in range(6)]
[0, 41, 83, 125, 166, 208]
If using the OverflowError approach, I think using a threshold of 2**BPF/2
would only make some "bins" (indices) have 1.5x the probability of
others rather than 2x, and thus would not remove the problem either.
----------
components: Library (Lib)
messages: 393285
nosy: Dennis Sweeney, rhettinger
priority: normal
severity: normal
status: open
title: Bias in random.choices(long_sequence)
type: behavior
versions: Python 3.10, Python 3.11
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue44080>
_______________________________________
More information about the New-bugs-announce
mailing list