choose value from custom distribution
Peter Otten
__peter__ at web.de
Tue Oct 19 05:22:20 EDT 2010
Chris Rebert wrote:
> On Mon, Oct 18, 2010 at 11:40 PM, Arnaud Delobelle <arnodel at gmail.com>
> wrote:
>> elsa <kerensaelise at hotmail.com> writes:
>>> Hello,
>>>
>>> I'm trying to find a way to collect a set of values from real data,
>>> and then sample values randomly from this data - so, the data I'm
>>> collecting becomes a kind of probability distribution. For instance, I
>>> might have age data for some children. It's very easy to collect this
>>> data using a list, where the index gives the value of the data, and
>>> the number in the list gives the number of times that values occurs:
>>>
>>> [0,0,10,20,5]
>>>
>>> could mean that there are no individuals that are no people aged 0, no
>>> people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
>>> in my data collection.
>>>
>>> I then want to make a random sample that would be representative of
>>> these proportions - is there any easy and fast way to select an entry
>>> weighted by its value? Or are there any python packages that allow you
>>> to easily create your own distribution based on collected data?
> <snip>
>> If you want to keep it simple, you can do:
>>
>>>>> t = [0,0,10,20,5]
>>>>> expanded = sum([[x]*f for x, f in enumerate(t)], [])
>>>>> random.sample(expanded, 10)
>> [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
>>>>> random.sample(expanded, 10)
>> [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
>>>>> random.sample(expanded, 10)
>> [3, 3, 3, 3, 3, 2, 3, 2, 2, 3]
>>
>> Is that what you need?
>
> The OP explicitly ruled that out:
>
>>> Two
>>> other things to bear in mind are that in reality I'm collating data
>>> from up to around 5 million individuals, so just making one long list
>>> with a new entry for each individual won't work.
Python can cope with a list of 5 million integer entries just fine on
average hardware. Eventually you may have to switch to Ian's cumulative sums
approach -- but not necessarily at 10**6.
>>> Also, it would be
>>> good if I didn't have to decide before hand what the possible range of
>>> values is (which unfortunately I have to do with the approach I'm
>>> currently working on).
This second objection seems invalid to me, too, and I think what Arnaud
provides is a useful counterexample.
However, if you (elsa) are operating near the limits of the available memory
on your machine using sum() on lists is not a good idea. It does the
equivalent of
expanded = []
for x, f in enumerate(t):
expanded = expanded + [x]*f
which creates a lot of "large" temporary lists where you want the more
memory-friendly
expanded = []
for x, f in enumerate(t):
expanded.extend([x]*f)
# expanded += [x]*f
> The internet is wrecking people's attention spans and reading
> comprehension.
Maybe, but I can't google the control group that is always offline and I
have a hunch that facebook wouldn't work either ;)
Peter
More information about the Python-list
mailing list