[SciPy-User] problem with discrete rvs?

Mon Aug 17 20:05:00 EDT 2009

On Mon, Aug 17, 2009 at 6:28 PM, <josef.pktd at gmail.com> wrote:
> On Mon, Aug 17, 2009 at 6:10 PM, <josef.pktd at gmail.com> wrote:
>> On Mon, Aug 17, 2009 at 5:47 PM, Robin<robince at gmail.com> wrote:
>>> Hi,
>>>
>>> If I understand the scipy.stats package I should be able to create my
>>> own arbitrary discrete distribution and generate samples from it... I
>>> am trying this but am having the following problem. I define my
>>> discrete distribution with the following:
>>>
>>> vals = (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,
>>> 14, 15, 16,
>>>       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
>>>  array([ 0.31518555,  0.08191701,  0.0817977 ,  0.02212394,  0.08149974,
>>>        0.02204335,  0.02201125,  0.00595341,  0.0817977 ,  0.02212394,
>>>        0.02209172,  0.00597518,  0.02201125,  0.00595341,  0.00594474,
>>>        0.00160788,  0.08161887,  0.02207557,  0.02204342,  0.00596212,
>>>        0.02196313,  0.0059404 ,  0.00593175,  0.00160437,  0.02204342,
>>>        0.00596212,  0.00595343,  0.00161023,  0.00593175,  0.00160437,
>>>        0.00160203,  0.0004333 ]))
>>>
>>> Then I do
>>> rv = rv_discrete(name='test',values=vals)
>>>
>>> and then try to generate samples. When I plot the results of this for
>>> large number of samples, it seems to match pretty well but the last 4
>>> or 5 values are never generated and have zero probability, even with
>>> enough samples to really capture them.
>>>
>>> In [742]: (te.rvs(size=1000000)==30).sum()
>>> Out[742]: 0
>>> In [743]: (te.rvs(size=1000000)==29).sum()
>>> Out[743]: 0
>>> In [744]: (te.rvs(size=1000000)==27).sum()
>>> Out[744]: 0
>>> In [745]: (te.rvs(size=1000000)==20).sum()
>>> Out[745]: 21884
>>>
>>> Does anyone have any ideas? Is this a bug in rv_discrete.rvs?
>>>
>>> In [746]: scipy.__version__
>>> Out[746]: '0.8.0.dev5825'
>>> In [747]: numpy.__version__
>>> Out[747]: '1.4.0.dev7039'
>>
>> just a quick check
>>
>>>>> vals[1].cumsum()
>> array([ 0.31518555,  0.39710256,  0.47890026,  0.5010242 ,  0.58252394,
>>        0.60456729,  0.62657854,  0.63253195,  0.71432965,  0.73645359,
>>        0.75854531,  0.76452049,  0.78653174,  0.79248515,  0.79842989,
>>        0.80003777,  0.88165664,  0.90373221,  0.92577563,  0.93173775,
>>        0.95370088,  0.95964128,  0.96557303,  0.9671774 ,  0.98922082,
>>        0.99518294,  1.00113637,  1.0027466 ,  1.00867835,  1.01028272,
>>        1.01188475,  1.01231805])
>>
>>
>> your probabilities don't add up to one.
>>
>> I haven't checked if this is the source of the problem
>>
>> Josef
>
> I don't know what your te is. but the following looks ok
>
> Josef
>
>
> import numpy as np
> from scipy import stats
>
>
> vals = [np.array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,
> 14, 15, 16,
>      17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
>  np.array([ 0.31518555,  0.08191701,  0.0817977 ,  0.02212394,  0.08149974,
>       0.02204335,  0.02201125,  0.00595341,  0.0817977 ,  0.02212394,
>       0.02209172,  0.00597518,  0.02201125,  0.00595341,  0.00594474,
>       0.00160788,  0.08161887,  0.02207557,  0.02204342,  0.00596212,
>       0.02196313,  0.0059404 ,  0.00593175,  0.00160437,  0.02204342,
>       0.00596212,  0.00595343,  0.00161023,  0.00593175,  0.00160437,
>       0.00160203,  0.0004333 ])]
>
> vals[1] /= vals[1].sum()
>
>
> te = stats.rv_discrete(name='test',values=vals)
>
> nsample = 10000
> xrvs =te.rvs(size=nsample)
> for i in range(25,31):
>    print i, (xrvs==i).sum(), nsample * vals[1][i]
>
>
> ########## result
>
> 25 43 58.8957195814
> 26 56 58.8098769947
> 27 19 15.9063646055
> 28 62 58.5957150522
> 29 17 15.8484776598
> 30 11 15.8253623948
>
>>
>>>
>>> Cheers
>>>
>>> Robin
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>

and here is the chisquare test, which has a pvalue of 0.6, in my
previous example, and so it looks pretty good

>>> count = np.bincount(xrvs)
>>> count
array([3119,  803,  826,  199,  815,  218,  201,   65,  815,  215,  213,
         66,  222,   78,   53,    9,  777,  209,  230,   60,  238,   53,
         60,   20,  223,   43,   56,   19,   62,   17,   11,    5])
>>> stats.chisquare(count, nsample * vals[1])
(28.074609178075399, 0.61731069879207001)

There is an example in the stats tutorial (which is only in the doc
editor and never has been cleaned up), that shows as an example a
discrete approximation to the truncated standard normal using
rv_discrete.

Thanks for reporting suspicious behavior. I'm always interested in
where the bugs are hiding.
 I will check how easy it is to verify that the probabilities add up
to one (up to floating point precision), and raise a value error in
case they don't.

Josef