Creating a Custom PDF with Arguments Using Scipy Stats
Hi all- I am trying to create a custom PDF with arguments for some statistical analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters: a normalization (N), an index (index), and a scale value (m0). I have tried to produce some code that creates this pdf and while the pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf) Can you suggest the changes I need to make to try and figure out why my code isn't working as planned? My code snippet is below: from scipy import stats import numpy as np from matplotlib import pyplot as plt class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) my_cv= my_pdf(a=1e-6,b=1e-2,name="Test") rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around 1e-3, but histogram peaks around 1e-5... I am really having a hard time figuring out how to create a robust and complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated. Cheers, Steve
08.06.2016 13:50 пользователь "Steven Ehlert" <cosmicsteve@gmail.com> написал:
Hi all-
I am trying to create a custom PDF with arguments for some statistical
analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like
P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters:
a normalization (N), an index (index), and a scale value (m0).
I have tried to produce some code that creates this pdf and while the pdf
and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf)
Can you suggest the changes I need to make to try and figure out why my
code isn't working as planned? My code snippet is below:
from scipy import stats import numpy as np from matplotlib import pyplot as plt
class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
my_cv= my_pdf(a=1e-6,b=1e-2,name="Test")
rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around
1e-3, but histogram peaks around 1e-5...
I am really having a hard time figuring out how to create a robust and
complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated.
Cheers,
Steve
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV. Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter. HTH, Evgeni
Okay- Thank you for the insights. So three questions about this, as I am not an expert in the implementation. I figured that pdf normalization may play a role First of all, what is the best way to numerically handle the normalization? Do I need to include the integral every time I initialize the pdf? Something like this: class my_pdf(stats.rv_continuous): def _pdf(self,x,index,m0): step=(self.b-self.a)/100. this_xrange=np.arange(self.a,self.b+step,step) this_function=this_xrange**(-1*index) * (1. -np.exp((-1*this_xrange**2)/m0**2) ) #for x,f in zip(this_xrange,this_function): print x,f this_norm=1./(integrate.simps(this_function,x=this_xrange)) #print this_norm return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) I don't think there is an analytic form for this integral, so I have to do it numerically, but I also have to implement it with the pdf. And the second question I have is: how do I call the scale variable within the _pdf function? Would I simply put in self.scale in place of m0 in the code above? Or is it more sophisticated than that? Finally, while I realized these concerns when I wrote it, but I didn't expect the pdf to look okay but the rvs to be so off- are there reasons to expect this behavior based on my previously incorrect normalization? It is not obvious to me that it would. Thank you for your help thus far and I look forward to learning more about custom statistical distributions in SciPy. Steven Ehlert On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
08.06.2016 13:50 пользователь "Steven Ehlert" <cosmicsteve@gmail.com> написал:
Hi all-
I am trying to create a custom PDF with arguments for some statistical
analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like
P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters:
a normalization (N), an index (index), and a scale value (m0).
I have tried to produce some code that creates this pdf and while the
pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf)
Can you suggest the changes I need to make to try and figure out why my
code isn't working as planned? My code snippet is below:
from scipy import stats import numpy as np from matplotlib import pyplot as plt
class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
my_cv= my_pdf(a=1e-6,b=1e-2,name="Test")
rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around
1e-3, but histogram peaks around 1e-5...
I am really having a hard time figuring out how to create a robust and
complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated.
Cheers,
Steve
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV.
Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter.
HTH,
Evgeni
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
On Wed, Jun 8, 2016 at 1:31 PM, Steven Ehlert <cosmicsteve@gmail.com> wrote:
Okay-
Thank you for the insights.
So three questions about this, as I am not an expert in the implementation. I figured that pdf normalization may play a role
First of all, what is the best way to numerically handle the normalization? Do I need to include the integral every time I initialize the pdf? Something like this:
class my_pdf(stats.rv_continuous): def _pdf(self,x,index,m0): step=(self.b-self.a)/100.
this_xrange=np.arange(self.a,self.b+step,step)
this_function=this_xrange**(-1*index) * (1. -np.exp((-1*this_xrange**2)/m0**2) )
#for x,f in zip(this_xrange,this_function): print x,f this_norm=1./(integrate.simps(this_function,x=this_xrange)) #print this_norm
return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
Yes, this would need to compute it at each call to pdf. The integral can be expressed in terms of gamma functions (do a substitution of y = x**2 /2 in the integral). Alternatively, you can do the integral numerically, likely `quad` is better than `simps`.
I don't think there is an analytic form for this integral, so I have to do it numerically, but I also have to implement it with the pdf.
And the second question I have is: how do I call the scale variable within the _pdf function? Would I simply put in self.scale in place of m0 in the code above? Or is it more sophisticated than that?
It's easy: you don't need to :-). distribution.pdf(x, loc=11, scale=42) calls distribution._pdf(y) with y = (x - loc) / scale. This way, you `_pdf` only needs to deal with a scaled variable. There is no self.scale --- it's just a parameter to any call to pdf, cdf, sf etc.
Finally, while I realized these concerns when I wrote it, but I didn't expect the pdf to look okay but the rvs to be so off- are there reasons to expect this behavior based on my previously incorrect normalization? It is not obvious to me that it would.
PDF needs to integrate to one. If the normalization is off, it still might look "okay" because the shape is the same, but it's still wrong. Drawing random variates assumes that the cdf is a real distribution function. From experience, having ok-looking PDF and nonsense for the random variates is a typical sign that the normalization might be off.
Thank you for your help thus far and I look forward to learning more about custom statistical distributions in SciPy.
Steven Ehlert
On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
08.06.2016 13:50 пользователь "Steven Ehlert" <cosmicsteve@gmail.com> написал:
Hi all-
I am trying to create a custom PDF with arguments for some statistical analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like
P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters: a normalization (N), an index (index), and a scale value (m0).
I have tried to produce some code that creates this pdf and while the pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf)
Can you suggest the changes I need to make to try and figure out why my code isn't working as planned? My code snippet is below:
from scipy import stats import numpy as np from matplotlib import pyplot as plt
class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
my_cv= my_pdf(a=1e-6,b=1e-2,name="Test")
rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around 1e-3, but histogram peaks around 1e-5...
I am really having a hard time figuring out how to create a robust and complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated.
Cheers,
Steve
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV.
Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter.
HTH,
Evgeni
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
Explicitly defining the cdf would probably help too. If it is analytically calculable, defining the inverse cdf would almost definitely help since that is how the random numbers are being generated. https://en.wikipedia.org/wiki/Inverse_transform_sampling What you have now only defines the pdf. The base class automagically integrates to get cdf, and might do some kind of numerical parameter optimization to get to the inverse cdf (not sure about that). If nothing else, defining the cdf/inverse cdf would make things faster. On Wed, Jun 8, 2016 at 8:46 AM Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
On Wed, Jun 8, 2016 at 1:31 PM, Steven Ehlert <cosmicsteve@gmail.com> wrote:
Okay-
Thank you for the insights.
So three questions about this, as I am not an expert in the implementation. I figured that pdf normalization may play a role
First of all, what is the best way to numerically handle the normalization? Do I need to include the integral every time I initialize the pdf? Something like this:
class my_pdf(stats.rv_continuous): def _pdf(self,x,index,m0): step=(self.b-self.a)/100.
this_xrange=np.arange(self.a,self.b+step,step)
this_function=this_xrange**(-1*index) * (1. -np.exp((-1*this_xrange**2)/m0**2) )
#for x,f in zip(this_xrange,this_function): print x,f this_norm=1./(integrate.simps(this_function,x=this_xrange)) #print this_norm
return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
Yes, this would need to compute it at each call to pdf. The integral can be expressed in terms of gamma functions (do a substitution of y = x**2 /2 in the integral). Alternatively, you can do the integral numerically, likely `quad` is better than `simps`.
I don't think there is an analytic form for this integral, so I have to do it numerically, but I also have to implement it with the pdf.
And the second question I have is: how do I call the scale variable within the _pdf function? Would I simply put in self.scale in place of m0 in the code above? Or is it more sophisticated than that?
It's easy: you don't need to :-). distribution.pdf(x, loc=11, scale=42) calls distribution._pdf(y) with y = (x - loc) / scale.
This way, you `_pdf` only needs to deal with a scaled variable. There is no self.scale --- it's just a parameter to any call to pdf, cdf, sf etc.
Finally, while I realized these concerns when I wrote it, but I didn't expect the pdf to look okay but the rvs to be so off- are there reasons to expect this behavior based on my previously incorrect normalization? It is not obvious to me that it would.
PDF needs to integrate to one. If the normalization is off, it still might look "okay" because the shape is the same, but it's still wrong. Drawing random variates assumes that the cdf is a real distribution function. From experience, having ok-looking PDF and nonsense for the random variates is a typical sign that the normalization might be off.
Thank you for your help thus far and I look forward to learning more about custom statistical distributions in SciPy.
Steven Ehlert
On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski < evgeny.burovskiy@gmail.com> wrote:
08.06.2016 13:50 пользователь "Steven Ehlert" <cosmicsteve@gmail.com> написал:
Hi all-
I am trying to create a custom PDF with arguments for some statistical analysis I am doing. I want to have a power law with an exponential
at low values: mathematically, this looks like
P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three
cutoff parameters:
a normalization (N), an index (index), and a scale value (m0).
I have tried to produce some code that creates this pdf and while the pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf)
Can you suggest the changes I need to make to try and figure out why my code isn't working as planned? My code snippet is below:
from scipy import stats import numpy as np from matplotlib import pyplot as plt
class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
my_cv= my_pdf(a=1e-6,b=1e-2,name="Test")
rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around 1e-3, but histogram peaks around 1e-5...
I am really having a hard time figuring out how to create a robust and complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated.
Cheers,
Steve
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV.
Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter.
HTH,
Evgeni
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
-- Kevin Gullikson Data Scientist Spark Cognition
Hi all- After doing some math, I was able to figure out how to normalize this pdf in a meaningful way and get the right math packages to interpret it for me. A new question: how do I properly call this within the context of a KS-test? What I want to start doing is comparing histograms of data to this pdf and see if they make sense. I am guessing it would be something like stats.kstest(stuffsample, stuff) where stuff=observed_stuff(name="Stuff",a=1e-6,b=100) but I am not sure how to make the call appropriate. The one I have above does not work, and I know I need to send off a few arguments. The new class with the correct math in place is below. Thanks for your help! class observed_stuff(stats.rv_continuous): def _pdf(self, x, massindex,complimit): dndm=(x)**(-1.*massindex) complete= 1. - np.exp(-1.*x**2/complimit**2) self.massindex=massindex self.complimit=complimit norm=self.integral(self.b)-self.integral(self.a) #print "Norm ", norm return 1./norm * dndm * complete def _cdf(self, xval,massindex,complimit): self.massindex=massindex self.complimit=complimit norm=self.integral(self.b)-self.integral(self.a) return 1./norm * (self.integral(xval)-self.integral(self.a)) def integral(self,xval): usq=xval**2/self.complimit**2 mi=self.massindex arg1= (1.-mi)/2. #We have to use mpmath for the incomplete gamma function that arises in this integral...otherwise the values are all screwy. This try-except snippet ensures the output is in the right format for scipy stats try: gammaval = [float(mpmath.gammainc(a,u)) for a,u in zip(arg1,usq)] gammaval=np.array(gammaval) except: gammaval=float(mpmath.gammainc(arg1,usq)) numerator= xval**(1-mi) * ( -2. * np.sqrt(usq) + (mi-1.)* (usq)**(mi/2.) * gammaval) denominator=2.*(mi-1.)*np.sqrt(usq) return numerator/denominator On Wed, Jun 8, 2016 at 8:54 AM, Kevin Gullikson <kevin.gullikson@gmail.com> wrote:
Explicitly defining the cdf would probably help too. If it is analytically calculable, defining the inverse cdf would almost definitely help since that is how the random numbers are being generated. https://en.wikipedia.org/wiki/Inverse_transform_sampling
What you have now only defines the pdf. The base class automagically integrates to get cdf, and might do some kind of numerical parameter optimization to get to the inverse cdf (not sure about that). If nothing else, defining the cdf/inverse cdf would make things faster.
On Wed, Jun 8, 2016 at 8:46 AM Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
On Wed, Jun 8, 2016 at 1:31 PM, Steven Ehlert <cosmicsteve@gmail.com> wrote:
Okay-
Thank you for the insights.
So three questions about this, as I am not an expert in the implementation. I figured that pdf normalization may play a role
First of all, what is the best way to numerically handle the normalization? Do I need to include the integral every time I initialize the pdf? Something like this:
class my_pdf(stats.rv_continuous): def _pdf(self,x,index,m0): step=(self.b-self.a)/100.
this_xrange=np.arange(self.a,self.b+step,step)
this_function=this_xrange**(-1*index) * (1. -np.exp((-1*this_xrange**2)/m0**2) )
#for x,f in zip(this_xrange,this_function): print x,f this_norm=1./(integrate.simps(this_function,x=this_xrange)) #print this_norm
return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
Yes, this would need to compute it at each call to pdf. The integral can be expressed in terms of gamma functions (do a substitution of y = x**2 /2 in the integral). Alternatively, you can do the integral numerically, likely `quad` is better than `simps`.
I don't think there is an analytic form for this integral, so I have to do it numerically, but I also have to implement it with the pdf.
And the second question I have is: how do I call the scale variable within the _pdf function? Would I simply put in self.scale in place of m0 in the code above? Or is it more sophisticated than that?
It's easy: you don't need to :-). distribution.pdf(x, loc=11, scale=42) calls distribution._pdf(y) with y = (x - loc) / scale.
This way, you `_pdf` only needs to deal with a scaled variable. There is no self.scale --- it's just a parameter to any call to pdf, cdf, sf etc.
Finally, while I realized these concerns when I wrote it, but I didn't expect the pdf to look okay but the rvs to be so off- are there reasons to expect this behavior based on my previously incorrect normalization? It is not obvious to me that it would.
PDF needs to integrate to one. If the normalization is off, it still might look "okay" because the shape is the same, but it's still wrong. Drawing random variates assumes that the cdf is a real distribution function. From experience, having ok-looking PDF and nonsense for the random variates is a typical sign that the normalization might be off.
Thank you for your help thus far and I look forward to learning more about custom statistical distributions in SciPy.
Steven Ehlert
On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski < evgeny.burovskiy@gmail.com> wrote:
08.06.2016 13:50 пользователь "Steven Ehlert" <cosmicsteve@gmail.com> написал:
Hi all-
I am trying to create a custom PDF with arguments for some
analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like
P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three
statistical parameters:
a normalization (N), an index (index), and a scale value (m0).
I have tried to produce some code that creates this pdf and while the pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf)
Can you suggest the changes I need to make to try and figure out why my code isn't working as planned? My code snippet is below:
from scipy import stats import numpy as np from matplotlib import pyplot as plt
class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) )
my_cv= my_pdf(a=1e-6,b=1e-2,name="Test")
rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around 1e-3, but histogram peaks around 1e-5...
I am really having a hard time figuring out how to create a robust and complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated.
Cheers,
Steve
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV.
Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter.
HTH,
Evgeni
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
-- Kevin Gullikson Data Scientist Spark Cognition
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-user
participants (3)
-
Evgeni Burovski -
Kevin Gullikson -
Steven Ehlert