Hello,
>
I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:
>
[0,0,10,20,5]
>
could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.
>
I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

My suggestion is to sample into a cumulative sum list and find the
index by binary search:

import bisect
import random

data = [0, 0, 10, 20, 5]
cumsum = []
for x in data:
cumsum.append(cumsum[-1] + x if cumsum else x)
virtual_index = random.randrange(cumsum[-1])
actual_index = bisect.bisect_right(cumsum, virtual_index)

