random.sample with large weighted sample-sets?
Peter Otten
__peter__ at web.de
Sun Feb 16 17:12:57 CET 2014
Tim Chase wrote:
> On 2014-02-16 04:12, Terry Reedy wrote:
>> On 2/15/2014 11:41 PM, Tim Chase wrote:
>> > data = (
>> > ("apple", 20),
>> > ("orange", 50),
>> > ("grape", 30),
>> > )
>
> To Ben, yes, this was just some sample data; the original gets built
> from an external (i.e., client-supplied, thus the need to gracefully
> support crazy-large numbers) data source and is indeed actually a list
> rather than a tuple. When entering static data like this, I often
> default to an outer tuple (rather than list) just as a hint/reminder
> to myself that I don't expect this to change at runtime (and have
> Python yell at me if I accidentally try).
>
>> If you actually start with date in this form, write the few lines
>> needed to produce the form below.
>>
>> import bisect
>> import random
>>
>> data = [
>> (0, 'apple'),
>> (20, 'orange'),
>> (70, 'grape'),
>> ]
>>
>> for i in range(10):
>> r = random.randrange(0, 100)
>> i = bisect.bisect(data, (r, 'zzzzz')) - 1
>> print(data[i][1])
>
> Trying to read what may be implicit assumptions in your code:
>
> 1) your code calculates "100" as sum(item[0] for item in data)
>
> 2) the data has to be sorted for bisect to work
>
> 3) you meant to write "(10, 'apple')" rather than 0. With my original
> example code, a 0-probability shouldn't ever show up in the sampling,
> where it looks like it might when using this sample code. In my
> particular use case, I can limit/ensure that 0-probability items never
> appear in the list, filtering them upon loading.
>
> 4) that "zzzzzz" is some arbitrary value that should come after any
> string that could appear in the data; perhaps using some custom
> "InfinityString" class where everything compared to it is always less
> than it.
>
> So it would be
>
> class InfinityString:
> def __gt__(self, other): True
> __ge__ = __gt__
> def __lt__(self, other): False
> __eq__ = __le__ = __ne__ = __lt__
> infinity_string = InfinityString()
> data = load_data() # list of (quantity, value) tuples
> data.sort()
> total = sum(qty for qty, value in data)
> for i in range(num_to_sample):
> r = random.randrange(0, total)
> i = bisect.bisect(data, (r, infinity_string)) - 1
> use(data[i][1])
>
> Some long-running testing on this code seems to show that if two
> items have the same probability, bisect only appears to find the last
> one. Tested with
>
> data = [
> (10, "apple"),
> (20, "banana"), # I never get any bananas, even after thousands of
> iterations (20, "grape"),
> (50, "orange"),
> ]
I think it becomes simpler if you make an intermediate list with the
cumulated probabilities. You can then avoid the InfinityString gymnastics:
import random, bisect
def cumulated(probs):
sigma = 0
cumprobs = []
for p in probs:
sigma += p
cumprobs.append(sigma)
return cumprobs
def pick(cumprobs):
return bisect.bisect(cumprobs, random.randrange(cumprobs[-1]))
data = [
(10, "apple"),
(20, "banana"),
(20, "grape"),
(50, "orange"),
]
cumprobs = cumulated(p for p, k in data)
# check for off-by-one bugs
bins = [0] * len(cumprobs)
for i in range(cumprobs[-1]):
bins[bisect.bisect(cumprobs, i)] += 1
assert bins == [p for p, k in data]
# use it
bins = [0] * len(cumprobs)
for i in range(10000):
bins[pick(cumprobs)] += 1
for item, bin in zip(data, bins):
print("{} --> {}".format(item, bin))
More information about the Python-list
mailing list