[SciPy-user] scipy.stats rv objects from data
Michael
mnandris at btinternet.com
Mon Apr 28 05:02:31 EDT 2008
rv_discrete: most (if not all) of the scipy.stats functions +
numpy.random cannot handle zeros as _inputs_ (don't know whether this is
related to getting zero's _out_, but it might be). The zero problem is,
I am told, due to the underlying c code, not python. A quick workaround
is to substitute any zeros for a small number like 1e-16
On Sun, 2008-04-27 at 16:29 -0700, Erik Tollerud wrote:
> I'm finding the scipy.stats documentation somewhat difficult to
> follow, so maybe the answer to this question is in there... I can't
> really find it, though.
> What I have is a sequence of numbers X_i . Two things I'd like to be
> able to do with this:
> 1. Create a discrete probability distribution (class rv_discrete) from
> this data so as to use the utility functions that take rv_discrete
> objects.
> The rv_discrete documentation suggests should be easy. I did the following
> >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test')
> >>>ddist.pmf(50)
> array(0.0)
There are at least 2 ways of using rv_discrete
e.g. 2 ways to calculate the next element of a simple Markov Chain with
x(n+1)=Norm(0.5 x(n),1)
from scipy.stats import rv_discrete
from numpy.random import multinomial
x = 3
n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0]
print n1
n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0]
print n2
print
sample = stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 )
print sample
The multinomial distribution from numpy.random is somewhat faster (40
times or so) but has a different idiom:
SIZE = 100000
VALUES = [0,1,2,3,4,5,6,7]
PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.]
The idiom for rv_discrete is
rv_discrete( name='sample', values=(VALUES,PROBS) )
The idiom for numpy.multinomial is different; if memory serves, you get
frequencies as output instead of the actual values
multinomial( SIZE, PROBS )
>>> from numpy.random import multinomial
>>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ])
array([12, 44, 10, 34])
>>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this
...
>>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this
>>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok
array([21, 0, 79, 0])
the last one is ok since the probability adds up to 1... painful, but it
works
> Any value I try to get of the pmf seems to be 0. Do I have to
> explicitly subclass rv_discrete with my data and a _pmf method or
> something? This seems like a very natural thing to want to do, and
> hence it seems odd to not have some helper like
> make_dist(x,name='whatever') . I can take a shot at creating such a
> function, but I don't want to do so if one exists.
>
> 2. Create a continuous probability distribution from something like
> spline fitting or simple linear interpolation of a the data in X_i.
> Does this require explict subclassing, or is there a straightforward
> way to do it that's builtin? I'm not sure if this step is strictly
> necessary - what I really want to do is be able to draw from the
> discrete distribution in 1 just by sampling the cdf... maybe this is
> how it's supposed to work with the discrete distribution, but when I
> tried to sample it using ddist.rvs, I would always get the input
> values I specified rather random values sampled from the cdf.
Continuous v's discrete: i found this in ./stats/scstats.py
from scipy import stats, r_
from pylab import show, plot
import copy
# SOURCE: ./stats/scstats.py
SPREAD = 10
class cdf( object ):
""" Baseclass for the task of determining a sequence of numbers {vi}
which is distributed as a random variable X
"""
def integerDensityFunction( self ):
"""
Outputs an integer density function: xs (ints) and ys (probabilities)
which are the correspondence between the whole numbers on the x axis
to the probabilities on the y axis, according to a normal distribution.
"""
opt = []
for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?)
opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) )
return zip(*opt) # [ (int...), (P...) ]
def display( self ):
xs, ys = self.integerDensityFunction()
plot( xs, ys )
show()
if __name__=='__main__':
d = cdf()
d.display()
Continuous: i can only suggest using rv_continuous
stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever
.rvs( shape, loc, scale ) is the random variates
.pdf( x, shape, loc, scale ) is the probability density function which,
i think, is or should be genuinely continuous
> I'm on scipy 0.6.0 and numpy 1.0.4
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
More information about the SciPy-User
mailing list