[Numpy-discussion] Statistical distributions on samples

Mon Aug 15 09:53:17 EDT 2011

Hi Chris and All,

On 12 August 2011 16:53, Christopher Jordan-Squire wrote:
> Hi Andrea--An easy way to get something like this would be
>
> import numpy as np
> import scipy.stats as stats
>
> sigma = #some reasonable standard deviation for your application
> x = stats.norm.rvs(size=1000, loc=125, scale=sigma)
> x = x[x>50]
> x = x[x<200]
>
> That will give a roughly normal distribution to your velocities, as long as,
> say, sigma<25. (I'm using the rule of thumb for the normal distribution that
> normal random samples lie 3 standard deviations away from the mean about 1
> out of 350 times.) Though you won't be able to get exactly normal errors
> about your mean since normal random samples can theoretically be of any
> size.
>
> You can use this same process for any other distribution, as long as you've
> chosen a scale variable so that the probability of samples being outside
> your desired interval is really small. Of course, once again your random
> errors won't be exactly from the distribution you get your original samples
> from.

Thank you for your suggestion. There are a couple of things I am not
clear with, however. The first one (the easy one), is: let's suppose I
need 200 values, and the accept/discard procedure removes 5 of them
from the list. Is there any way to draw these 200 values from a bigger
sample so that the accept/reject procedure will not interfere too
much? And how do I get 200 values out of the bigger sample so that
these values are still representative?

Another thing, possibly completely unrelated. I am trying to design a
toy Latin Hypercube script (just for my own understanding). I found
this piece of code on the web (and I modified it slightly):

def lhs(dist, size=100):
    '''
    Latin Hypercube sampling of any distrbution.
    dist is is a scipy.stats random number generator
    such as stats.norm, stats.beta, etc
    parms is a tuple with the parameters needed for
    the specified distribution.

    :Parameters:
        - `dist`: random number generator from scipy.stats module.
        - `size` :size for the output sample
    '''

    n = size

    perc = numpy.arange(0.0, 1.0, 1.0/n)
    numpy.random.shuffle(perc)

    smp = [stats.uniform(i,1.0/n).rvs() for i in perc]

    v = dist.ppf(smp)

    return v

Now, I am not 100% clear of what the percent point function is (I have
read around the web, but please keep in mind that my statistical
skills are close to minus infinity). From this page:

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm

I gather that, if you plot the results of the ppf, with the horizontal
axis as probability, the vertical axis goes from the smallest to the
largest value of the cumulative distribution function. If i do this:

numpy.random.seed(123456)

distribution = stats.norm(loc=125, scale=25)

my_lhs = lhs(distribution, 50)

Will my_lhs always contain valid values (i.e., included between 50 and
200)? I assume the answer is no... but even if this was the case, is
this my_lhs array ready to be used to setup a LHS experiment when I
have multi-dimensional problems (in which all the variables are
completely independent from each other - no correlation)?

My apologies for the idiocy of the questions.

Andrea.

"Imagination Is The Only Weapon In The War Against Reality."
http://xoomer.alice.it/infinity77/

>>> import PyQt4.QtGui
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named PyQt4.QtGui
>>>
>>> import pygtk
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named pygtk
>>>
>>> import wx
>>>
>>>