[SciPy-User] problem with discrete rvs?
josef.pktd at gmail.com
josef.pktd at gmail.com
Mon Aug 17 20:05:00 EDT 2009
On Mon, Aug 17, 2009 at 6:28 PM, <josef.pktd at gmail.com> wrote:
> On Mon, Aug 17, 2009 at 6:10 PM, <josef.pktd at gmail.com> wrote:
>> On Mon, Aug 17, 2009 at 5:47 PM, Robin<robince at gmail.com> wrote:
>>> Hi,
>>>
>>> If I understand the scipy.stats package I should be able to create my
>>> own arbitrary discrete distribution and generate samples from it... I
>>> am trying this but am having the following problem. I define my
>>> discrete distribution with the following:
>>>
>>> vals = (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
>>> 14, 15, 16,
>>> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
>>> array([ 0.31518555, 0.08191701, 0.0817977 , 0.02212394, 0.08149974,
>>> 0.02204335, 0.02201125, 0.00595341, 0.0817977 , 0.02212394,
>>> 0.02209172, 0.00597518, 0.02201125, 0.00595341, 0.00594474,
>>> 0.00160788, 0.08161887, 0.02207557, 0.02204342, 0.00596212,
>>> 0.02196313, 0.0059404 , 0.00593175, 0.00160437, 0.02204342,
>>> 0.00596212, 0.00595343, 0.00161023, 0.00593175, 0.00160437,
>>> 0.00160203, 0.0004333 ]))
>>>
>>> Then I do
>>> rv = rv_discrete(name='test',values=vals)
>>>
>>> and then try to generate samples. When I plot the results of this for
>>> large number of samples, it seems to match pretty well but the last 4
>>> or 5 values are never generated and have zero probability, even with
>>> enough samples to really capture them.
>>>
>>> In [742]: (te.rvs(size=1000000)==30).sum()
>>> Out[742]: 0
>>> In [743]: (te.rvs(size=1000000)==29).sum()
>>> Out[743]: 0
>>> In [744]: (te.rvs(size=1000000)==27).sum()
>>> Out[744]: 0
>>> In [745]: (te.rvs(size=1000000)==20).sum()
>>> Out[745]: 21884
>>>
>>> Does anyone have any ideas? Is this a bug in rv_discrete.rvs?
>>>
>>> In [746]: scipy.__version__
>>> Out[746]: '0.8.0.dev5825'
>>> In [747]: numpy.__version__
>>> Out[747]: '1.4.0.dev7039'
>>
>> just a quick check
>>
>>>>> vals[1].cumsum()
>> array([ 0.31518555, 0.39710256, 0.47890026, 0.5010242 , 0.58252394,
>> 0.60456729, 0.62657854, 0.63253195, 0.71432965, 0.73645359,
>> 0.75854531, 0.76452049, 0.78653174, 0.79248515, 0.79842989,
>> 0.80003777, 0.88165664, 0.90373221, 0.92577563, 0.93173775,
>> 0.95370088, 0.95964128, 0.96557303, 0.9671774 , 0.98922082,
>> 0.99518294, 1.00113637, 1.0027466 , 1.00867835, 1.01028272,
>> 1.01188475, 1.01231805])
>>
>>
>> your probabilities don't add up to one.
>>
>> I haven't checked if this is the source of the problem
>>
>> Josef
>
> I don't know what your te is. but the following looks ok
>
> Josef
>
>
> import numpy as np
> from scipy import stats
>
>
> vals = [np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
> 14, 15, 16,
> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
> np.array([ 0.31518555, 0.08191701, 0.0817977 , 0.02212394, 0.08149974,
> 0.02204335, 0.02201125, 0.00595341, 0.0817977 , 0.02212394,
> 0.02209172, 0.00597518, 0.02201125, 0.00595341, 0.00594474,
> 0.00160788, 0.08161887, 0.02207557, 0.02204342, 0.00596212,
> 0.02196313, 0.0059404 , 0.00593175, 0.00160437, 0.02204342,
> 0.00596212, 0.00595343, 0.00161023, 0.00593175, 0.00160437,
> 0.00160203, 0.0004333 ])]
>
> vals[1] /= vals[1].sum()
>
>
> te = stats.rv_discrete(name='test',values=vals)
>
> nsample = 10000
> xrvs =te.rvs(size=nsample)
> for i in range(25,31):
> print i, (xrvs==i).sum(), nsample * vals[1][i]
>
>
> ########## result
>
> 25 43 58.8957195814
> 26 56 58.8098769947
> 27 19 15.9063646055
> 28 62 58.5957150522
> 29 17 15.8484776598
> 30 11 15.8253623948
>
>>
>>>
>>> Cheers
>>>
>>> Robin
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>
and here is the chisquare test, which has a pvalue of 0.6, in my
previous example, and so it looks pretty good
>>> count = np.bincount(xrvs)
>>> count
array([3119, 803, 826, 199, 815, 218, 201, 65, 815, 215, 213,
66, 222, 78, 53, 9, 777, 209, 230, 60, 238, 53,
60, 20, 223, 43, 56, 19, 62, 17, 11, 5])
>>> stats.chisquare(count, nsample * vals[1])
(28.074609178075399, 0.61731069879207001)
There is an example in the stats tutorial (which is only in the doc
editor and never has been cleaned up), that shows as an example a
discrete approximation to the truncated standard normal using
rv_discrete.
Thanks for reporting suspicious behavior. I'm always interested in
where the bugs are hiding.
I will check how easy it is to verify that the probabilities add up
to one (up to floating point precision), and raise a value error in
case they don't.
Josef
More information about the SciPy-User
mailing list