[Numpy-discussion] Statistical distributions on samples

Mon Aug 15 15:40:16 EDT 2011

On Mon, Aug 15, 2011 at 8:53 AM, Andrea Gavana <andrea.gavana at gmail.com>wrote:

> Hi Chris and All,
>
> On 12 August 2011 16:53, Christopher Jordan-Squire wrote:
> > Hi Andrea--An easy way to get something like this would be
> >
> > import numpy as np
> > import scipy.stats as stats
> >
> > sigma = #some reasonable standard deviation for your application
> > x = stats.norm.rvs(size=1000, loc=125, scale=sigma)
> > x = x[x>50]
> > x = x[x<200]
> >
> > That will give a roughly normal distribution to your velocities, as long
> as,
> > say, sigma<25. (I'm using the rule of thumb for the normal distribution
> that
> > normal random samples lie 3 standard deviations away from the mean about
> 1
> > out of 350 times.) Though you won't be able to get exactly normal errors
> > about your mean since normal random samples can theoretically be of any
> > size.
> >
> > You can use this same process for any other distribution, as long as
> you've
> > chosen a scale variable so that the probability of samples being outside
> > your desired interval is really small. Of course, once again your random
> > errors won't be exactly from the distribution you get your original
> samples
> > from.
>
> Thank you for your suggestion. There are a couple of things I am not
> clear with, however. The first one (the easy one), is: let's suppose I
> need 200 values, and the accept/discard procedure removes 5 of them
> from the list. Is there any way to draw these 200 values from a bigger
> sample so that the accept/reject procedure will not interfere too
> much? And how do I get 200 values out of the bigger sample so that
> these values are still representative?
>

FWIW, I'm not really advocating a truncated normal so much as making the
standard deviation small enough so that there's no real difference between a
true normal distribution and a truncated normal.

If you're worried about getting exactly 200 samples, then you could sample N
with N>200 and such that after throwing out the ones that lie outside your
desired region you're left with M>200. Then just randomly pick 200 from
those M. That shouldn't bias anything as long as you randomly pick them. (Or
just pick the first 200, if you haven't done anything to impose any order on
the samples, such as sorting them by size.) But I'm not sure why you'd want
exactly 200 samples instead of some number of samples close to 200.

>
> Another thing, possibly completely unrelated. I am trying to design a
> toy Latin Hypercube script (just for my own understanding). I found
> this piece of code on the web (and I modified it slightly):
>
> def lhs(dist, size=100):
>    '''
>    Latin Hypercube sampling of any distrbution.
>    dist is is a scipy.stats random number generator
>    such as stats.norm, stats.beta, etc
>    parms is a tuple with the parameters needed for
>    the specified distribution.
>
>    :Parameters:
>        - `dist`: random number generator from scipy.stats module.
>        - `size` :size for the output sample
>    '''
>
>    n = size
>
>    perc = numpy.arange(0.0, 1.0, 1.0/n)
>    numpy.random.shuffle(perc)
>
>    smp = [stats.uniform(i,1.0/n).rvs() for i in perc]
>
>    v = dist.ppf(smp)
>
>    return v
>
>
> Now, I am not 100% clear of what the percent point function is (I have
> read around the web, but please keep in mind that my statistical
> skills are close to minus infinity). From this page:
>
> http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm
>
>
The ppf is what's called the quantile function elsewhere. I do not know why
scipy calls it the ppf/percent point function.

The quantile function is the inverse of the cumulative density function
(cdf). So dist.ppf(z) is the x such that P(dist <= x) = z. Roughly. (Things
get slightly more finicky if you think about discrete distributions because
then you have to pick what happens at the jumps in the cdf.) So
dist.ppf(0.5) gives the median of dist, and dist.ppf(0.25) gives the
lower/first quartile of dist.

> I gather that, if you plot the results of the ppf, with the horizontal
> axis as probability, the vertical axis goes from the smallest to the
> largest value of the cumulative distribution function. If i do this:
>
> numpy.random.seed(123456)
>
> distribution = stats.norm(loc=125, scale=25)
>
> my_lhs = lhs(distribution, 50)
>
> Will my_lhs always contain valid values (i.e., included between 50 and
> 200)? I assume the answer is no... but even if this was the case, is
> this my_lhs array ready to be used to setup a LHS experiment when I
> have multi-dimensional problems (in which all the variables are
> completely independent from each other - no correlation)?
>
>
I'm not really sure if the above function is doing the lhs you want.   To
answer your question, it won't always generate values within [50,200]. If
size is large enough then you're dividing up the probability space evenly.
So even with the random perturbations (whose use I don't really understand),
you'll ensure that some of the samples you get when you apply the ppf will
correspond to the extremely low probability samples that are <50 or >200.

-Chris JS

My apologies for the idiocy of the questions.
>
> Andrea.
>
> "Imagination Is The Only Weapon In The War Against Reality."
> http://xoomer.alice.it/infinity77/
>
> >>> import PyQt4.QtGui
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in <module>
> ImportError: No module named PyQt4.QtGui
> >>>
> >>> import pygtk
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in <module>
> ImportError: No module named pygtk
> >>>
> >>> import wx
> >>>
> >>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110815/aeb40e28/attachment.html>