[Numpy-discussion] Question about numpy.random.choice with probabilties

Tue Jan 17 19:51:25 EST 2017

On Tue, Jan 17, 2017 at 6:58 PM, alebarde at gmail.com <alebarde at gmail.com>
wrote:

>
>
> 2017-01-17 22:13 GMT+01:00 Nadav Har'El <nyh at scylladb.com>:
>
>>
>> On Tue, Jan 17, 2017 at 7:18 PM, alebarde at gmail.com <alebarde at gmail.com>
>> wrote:
>>
>>> Hi Nadav,
>>>
>>> I may be wrong, but I think that the result of the current
>>> implementation is actually the expected one.
>>> Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and
>>> 0.4
>>>
>>> P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
>>>
>>
>> Yes, this formula does fit well with the actual algorithm in the code.
>> But, my question is *why* we want this formula to be correct:
>>
>> Just a note: this formula is correct and it is one of statistics
> fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability +
> https://en.wikipedia.org/wiki/Bayes%27_theorem
> Thus, the result we get from random.choice IMHO definitely makes sense. Of
> course, I think we could always discuss about implementing other sampling
> methods if they are useful to some application.
>
>
>>
>>> Now, P([1]) = 0.2 and P([2]) = 0.4. However:
>>> P([2] | 1st=[1]) = 0.5     (2 and 3 have the same sampling probability)
>>> P([1] | 1st=[2]) = 1/3     (1 and 3 have probability 0.2 and 0.4 that,
>>> once normalised, translate into 1/3 and 2/3 respectively)
>>> Therefore P([1,2]) = 0.7/3 = 0.23333
>>> Similarly, P([1,3]) = 0.23333 and P([2,3]) = 1.6/3 = 0.533333
>>>
>>
>> Right, these are the numbers that the algorithm in the current code, and
>> the formula above, produce:
>>
>> P([1,2]) = P([1,3]) = 0.23333
>> P([2,3]) = 0.53333
>>
>> What I'm puzzled about is that these probabilities do not really fullfill
>> the given probability vector 0.2, 0.4, 0.4...
>> Let me try to explain explain:
>>
>> Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three
>> items in the first place?
>>
>> One reasonable interpretation is that the user wants in his random picks
>> to see item 1 half the time of item 2 or 3.
>> For example, maybe item 1 costs twice as much as item 2 or 3, so picking
>> it half as often will result in an equal expenditure on each item.
>>
>> If the user randomly picks the items individually (a single item at a
>> time), he indeed gets exactly this distribution: 0.2 of the time item 1,
>> 0.4 of the time item 2, 0.4 of the time item 3.
>>
>> Now, what happens if he picks not individual items, but pairs of
>> different items using numpy.random.choice with two items, replace=false?
>> Suddenly, the distribution of the individual items in the results get
>> skewed: If we look at the expected number of times we'll see each item in
>> one draw of a random pair, we will get:
>>
>> E(1) = P([1,2]) + P([1,3]) = 0.46666
>> E(2) = P([1,2]) + P([2,3]) = 0.76666
>> E(3) = P([1,3]) + P([2,3]) = 0.76666
>>
>> Or renormalizing by dividing by 2:
>>
>> P(1) = 0.233333
>> P(2) = 0.383333
>> P(3) = 0.383333
>>
>> As you can see this is not quite the probabilities we wanted (which were
>> 0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
>> often than we wanted, and item 2 and 3 were used a bit less often!
>>
>
> p is not the probability of the output but the one of the source finite
> population. I think that if you want to preserve that distribution, as
> Josef pointed out, you have to make extractions independent, that is either
> sample with replacement or approximate an infinite population (that is
> basically the same thing).  But of course in this case you will also end up
> with events [X,X].
>

With replacement and keeping duplicates the results might also be similar
in the pattern of the marginal probabilities
https://onlinecourses.science.psu.edu/stat506/node/17

Another approach in survey sampling is also to drop duplicates in with
replacement sampling, but then the sample size itself is random. (again I
didn't try to understand the small print)

(another related aside: The problem with discrete sample space in small
samples shows up also in calculating hypothesis tests, e.g. fisher's exact
or similar. Because, we only get a few discrete possibilities in the sample
space, it is not possible to construct a test that has exactly the desired
type 1 error.)

Josef

>
>
>> So that brought my question of why we consider these numbers right.
>>
>> In this example, it's actually possible to get the right item
>> distribution, if we pick the pair outcomes with the following probabilties:
>>
>>    P([1,2]) = 0.2        (not 0.233333 as above)
>>    P([1,3]) = 0.2
>>    P([2,3]) = 0.6        (not 0.533333 as above)
>>
>> Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4
>>
>> Interestingly, fixing things like I suggest is not always possible.
>> Consider a different probability-vector example for three items - 0.99,
>> 0.005, 0.005. Now, no matter which algorithm we use for randomly picking
>> pairs from these three items, *each* returned pair will inevitably contain
>> one of the two very-low-probability items, so each of those items will
>> appear in roughly half the pairs, instead of in a vanishingly small
>> percentage as we hoped.
>>
>> But in other choices of probabilities (like the one in my original
>> example), there is a solution. For 2-out-of-3 sampling we can actually show
>> a system of three linear equations in three variables, so there is always
>> one solution but if this solution has components not valid as probabilities
>> (not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
>> 0.005 example.
>>
>>
>>
>>> What am I missing?
>>>
>>> Alessandro
>>>
>>>
>>> 2017-01-17 13:00 GMT+01:00 <numpy-discussion-request at scipy.org>:
>>>
>>>> Hi, I'm looking for a way to find a random sample of C different items
>>>> out
>>>> of N items, with a some desired probabilty Pi for each item i.
>>>>
>>>> I saw that numpy has a function that supposedly does this,
>>>> numpy.random.choice (with replace=False and a probabilities array), but
>>>> looking at the algorithm actually implemented, I am wondering in what
>>>> sense
>>>> are the probabilities Pi actually obeyed...
>>>>
>>>> To me, the code doesn't seem to be doing the right thing... Let me
>>>> explain:
>>>>
>>>> Consider a simple numerical example: We have 3 items, and need to pick 2
>>>> different ones randomly. Let's assume the desired probabilities for
>>>> item 1,
>>>> 2 and 3 are: 0.2, 0.4 and 0.4.
>>>>
>>>> Working out the equations there is exactly one solution here: The random
>>>> outcome of numpy.random.choice in this case should be [1,2] at
>>>> probability
>>>> 0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is
>>>> indeed
>>>> a solution for the desired probabilities because it yields item 1 in
>>>> [1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
>>>> 0.2+0.6 = 0.8 = 2*P2, etc.
>>>>
>>>> However, the algorithm in numpy.random.choice's replace=False
>>>> generates, if
>>>> I understand correctly, different probabilities for the outcomes: I
>>>> believe
>>>> in this case it generates [1,2] at probability 0.23333, [1,3] also
>>>> 0.2333,
>>>> and [2,3] at probability 0.53333.
>>>>
>>>> My question is how does this result fit the desired probabilities?
>>>>
>>>> If we get [1,2] at probability 0.23333 and [1,3] at probability 0.2333,
>>>> then the expect number of "1" results we'll get per drawing is 0.23333 +
>>>> 0.2333 = 0.46666, and similarly for "2" the expected number 0.7666, and
>>>> for
>>>> "3" 0.76666. As you can see, the proportions are off: Item 2 is NOT
>>>> twice
>>>> common than item 1 as we originally desired (we asked for probabilities
>>>> 0.2, 0.4, 0.4 for the individual items!).
>>>>
>>>>
>>>> --
>>>> Nadav Har'El
>>>> nyh at scylladb.com
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen
>>>> ts/20170117/d1f0a1db/attachment-0001.html>
>>>>
>>>> ------------------------------
>>>>
>>>> Subject: Digest Footer
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> End of NumPy-Discussion Digest, Vol 124, Issue 24
>>>> *************************************************
>>>>
>>>
>>>
>>>
>>> --
>>> ------------------------------------------------------------
>>> --------------
>>> NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may
>>> contain confidential information and are intended for the sole use of the
>>> recipient(s) named above. If you are not the intended recipient of this
>>> message you are hereby notified that any dissemination or copying of this
>>> message is strictly prohibited. If you have received this e-mail in error,
>>> please notify the sender either by telephone or by e-mail and delete the
>>> material from any computer. Thank you.
>>> ------------------------------------------------------------
>>> --------------
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> --------------------------------------------------------------------------
> NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
> confidential information and are intended for the sole use of the
> recipient(s) named above. If you are not the intended recipient of this
> message you are hereby notified that any dissemination or copying of this
> message is strictly prohibited. If you have received this e-mail in error,
> please notify the sender either by telephone or by e-mail and delete the
> material from any computer. Thank you.
> --------------------------------------------------------------------------
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170117/14a5e63d/attachment.html>