stats.distributions.poisson loc parameter : is it wise ?
All, Consider the poisson distribution in stats.distributions: it requires a mandatory argument, `mu`, as the mean/variance of the distribution. All is fine, but the `loc` parameter is still available, and that's my problem. When `loc` is not 0, the mean becomes `mu+loc`, `.cdf(range(loc))==0`, but the variance stays `mu`. That's a bit confusing. I thought I could use `loc` as a way to control truncation, but that doesn't seem to work either: emulating zero-truncation by using `loc=1` gives a distribution with a mean `mu+1` when is should be `mu/ (1-exp(-mu))` (the exact expression for zero-truncation). In short, I don't really see any advantage in having a location parameter for the Poisson distribution. AAMOF, for any discrete distribution. I suggest we would implement some mechanism to force loc to 0 while outputting a warning. Any comment ? P.
Hi, I agree. Anything that makes the behavior of the distribution functions more intuitive is helpful, at least to me. BTW, I find the term loc already by itself very confusing---what does it actually mean? For instance,
Help on gamma_gen in module scipy.stats.distributions object ...
| cdf(self, x, *args, **kwds) | Cumulative distribution function at x of the given RV. | | Parameters | ---------- | x : array-like | quantiles | arg1, arg2, arg3,... : array-like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array-like, optional | location parameter (default=0) | scale : array-like, optional | scale parameter (default=1) I am inclined to characterize the gamma distbution by means of n (number of stages if one is used to the Erlang distribution) and the rate parameter lambda, say, and I am clueless as to the meaning of scale and location here. Actually, I am not alone in this: see for instance: http://www.johndcook.com/blog/2009/07/20/probability-distributions-scipy/ Of course, this is not to say that I am not happy with the distribution package. It makes me a happier man every day :-) Nicky 2009/8/6 Pierre GM <pgmdevlist@gmail.com>:
All, Consider the poisson distribution in stats.distributions: it requires a mandatory argument, `mu`, as the mean/variance of the distribution. All is fine, but the `loc` parameter is still available, and that's my problem. When `loc` is not 0, the mean becomes `mu+loc`, `.cdf(range(loc))==0`, but the variance stays `mu`. That's a bit confusing. I thought I could use `loc` as a way to control truncation, but that doesn't seem to work either: emulating zero-truncation by using `loc=1` gives a distribution with a mean `mu+1` when is should be `mu/ (1-exp(-mu))` (the exact expression for zero-truncation). In short, I don't really see any advantage in having a location parameter for the Poisson distribution. AAMOF, for any discrete distribution. I suggest we would implement some mechanism to force loc to 0 while outputting a warning. Any comment ? P.
_______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Thu, Aug 6, 2009 at 16:21, nicky van foreest<vanforeest@gmail.com> wrote:
Hi,
I agree. Anything that makes the behavior of the distribution functions more intuitive is helpful, at least to me.
BTW, I find the term loc already by itself very confusing---what does it actually mean? For instance,
Help on gamma_gen in module scipy.stats.distributions object ...
| cdf(self, x, *args, **kwds) | Cumulative distribution function at x of the given RV. | | Parameters | ---------- | x : array-like | quantiles | arg1, arg2, arg3,... : array-like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array-like, optional | location parameter (default=0) | scale : array-like, optional | scale parameter (default=1)
I am inclined to characterize the gamma distbution by means of n (number of stages if one is used to the Erlang distribution) and the rate parameter lambda, say, and I am clueless as to the meaning of scale and location here.
Every probability distribution can be generalized to accept a location and scale parameter even if their standard treatments do not. pdf(x; loc,scale) -> pdf((x-loc)/scale)/scale The other related functions transform in easily derivable ways. This is covered at the top of the document scipy/stats/continuous.lyx in the source distribution. The reason we do this is partly generality and mostly convenience of implementation; all of the distributions can share the shifting and scaling code instead of implementing it separately. I once floated the idea of removing this for the distributions whose standard definitions do not include such parameters, specifically gamma. However, there was an objection from someone who apparently has used a "shifted gamma" distribution to model sunspot radii where loc>0, if I remember correctly, so I dropped my proposal. If you don't need to use them, don't. If you want to prevent confusion, help port the LyX documentation into the main documentation. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Aug 6, 2009, at 5:34 PM, Robert Kern wrote:
Every probability distribution can be generalized to accept a location and scale parameter even if their standard treatments do not.
pdf(x; loc,scale) -> pdf((x-loc)/scale)/scale
Agreed, as long as we are talking about *continuous* distributions. The behavior is quite different for *discrete* distributions. Even if the scale is simply discarded already, using a location will probably NOT give the expected result
On Thu, Aug 6, 2009 at 16:43, Pierre GM<pgmdevlist@gmail.com> wrote:
On Aug 6, 2009, at 5:34 PM, Robert Kern wrote:
Every probability distribution can be generalized to accept a location and scale parameter even if their standard treatments do not.
pdf(x; loc,scale) -> pdf((x-loc)/scale)/scale
Agreed, as long as we are talking about *continuous* distributions. The behavior is quite different for *discrete* distributions. Even if the scale is simply discarded already, using a location will probably NOT give the expected result
It depends on what your expectations are. For the discrete distributions, all the loc parameter means is this, as documented: pmf(x; loc) -> pmf(x-loc) That's it. I don't know why you would expect anything else. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Aug 6, 2009, at 5:49 PM, Robert Kern wrote:
On Thu, Aug 6, 2009 at 16:43, Pierre GM<pgmdevlist@gmail.com> wrote:
Even if the scale is simply discarded already, using a location will probably NOT give the expected result
It depends on what your expectations are. For the discrete distributions, all the loc parameter means is this, as documented:
pmf(x; loc) -> pmf(x-loc)
That's it. I don't know why you would expect anything else.
Because using a location parameter, you change the support domain. Back to the example of a Poisson distribution with loc=1, the support domain is now x>=1, which amounts to truncating the zeroes. The mean of a zero-truncated Poisson with parameter pr should be pr/(1-exp(- pr)), but we end up with pr+1. Not the expected result. I think it's a source of confusion to keep a location parameter for discrete distributions. it'd be worth to implement method to allow truncation, but just a loc parameter doesn't do it.
On Thu, Aug 6, 2009 at 17:02, Pierre GM<pgmdevlist@gmail.com> wrote:
On Aug 6, 2009, at 5:49 PM, Robert Kern wrote:
On Thu, Aug 6, 2009 at 16:43, Pierre GM<pgmdevlist@gmail.com> wrote:
Even if the scale is simply discarded already, using a location will probably NOT give the expected result
It depends on what your expectations are. For the discrete distributions, all the loc parameter means is this, as documented:
pmf(x; loc) -> pmf(x-loc)
That's it. I don't know why you would expect anything else.
Because using a location parameter, you change the support domain. Back to the example of a Poisson distribution with loc=1, the support domain is now x>=1, which amounts to truncating the zeroes.
I don't understand why you go through all of these contortions. It does not amount to truncation at all. It just shifts the distribution.
The mean of a zero-truncated Poisson with parameter pr should be pr/(1-exp(- pr)), but we end up with pr+1. Not the expected result.
Because you are expecting that the operation is equivalent to something that it is not. pmf(x; loc) -> pmf(x-loc) Nothing more. It is definitely *not* the same thing as setting all x<loc to 0 and renormalizing. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Aug 6, 2009 at 6:02 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
On Aug 6, 2009, at 5:49 PM, Robert Kern wrote:
On Thu, Aug 6, 2009 at 16:43, Pierre GM<pgmdevlist@gmail.com> wrote:
Even if the scale is simply discarded already, using a location will probably NOT give the expected result
It depends on what your expectations are. For the discrete distributions, all the loc parameter means is this, as documented:
pmf(x; loc) -> pmf(x-loc)
That's it. I don't know why you would expect anything else.
Because using a location parameter, you change the support domain. Back to the example of a Poisson distribution with loc=1, the support domain is now x>=1, which amounts to truncating the zeroes. The mean of a zero-truncated Poisson with parameter pr should be pr/(1-exp(- pr)), but we end up with pr+1. Not the expected result. I think it's a source of confusion to keep a location parameter for discrete distributions. it'd be worth to implement method to allow truncation, but just a loc parameter doesn't do it.
_______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
loc just shifts the distribution on the real/integer line. except for the fit method (which doesn't exist for discrete distribution), I don't see any real disadvantage to having loc in there as an option, but I guess in many cases it won't be very useful either. I think there are also discrete distribution with unbound support +/- inf for which a loc shift would make sense. The big advantage of the current setup, as Robert said, is consistency, both in the implementation and in code that goes over all (or a large set of) distribution(s). But for a long time, I have been all in favor of "fixing" the fit method, and possibly introduce a semi-frozen distribution class, but for this I don't see why we should special case location. fixing loc is the main use case, but for example estimation with the scale parameter fixed is also a common use case. Josef
On Thu, Aug 6, 2009 at 17:16, <josef.pktd@gmail.com> wrote:
But for a long time, I have been all in favor of "fixing" the fit method,
I don't think anyone's *against* fixing the fit method. No one's found the time or motivation to actually do it, though. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Aug 6, 2009 at 17:02, Pierre GM<pgmdevlist@gmail.com> wrote:
On Aug 6, 2009, at 5:49 PM, Robert Kern wrote:
On Thu, Aug 6, 2009 at 16:43, Pierre GM<pgmdevlist@gmail.com> wrote:
Even if the scale is simply discarded already, using a location will probably NOT give the expected result
It depends on what your expectations are. For the discrete distributions, all the loc parameter means is this, as documented:
pmf(x; loc) -> pmf(x-loc)
That's it. I don't know why you would expect anything else.
Because using a location parameter, you change the support domain.
It should be noted that the location parameter changes the support domain *as a consequence* of the above transformation. Changing the support domain (and holding everything else fixed) is not the defining characteristic of the location parameter. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Aug 6, 2009, at 6:21 PM, Robert Kern wrote:
It should be noted that the location parameter changes the support domain *as a consequence* of the above transformation. Changing the support domain (and holding everything else fixed) is not the defining characteristic of the location parameter.
Got the point. I'll make a mental note to mention that in the docs. I'm switching to "meh" mode: I still think that allowing for the shift can lead to some troubles on the user, and I'd be in favor to modify _fix_loc_scale or something like that to force loc=0 on discrete distributions with support on positive integers, but I'll certainly not lose any sleep other that... In any case, thx a lot to y'all for your comments.
On Thu, Aug 6, 2009 at 6:37 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
On Aug 6, 2009, at 6:21 PM, Robert Kern wrote:
It should be noted that the location parameter changes the support domain *as a consequence* of the above transformation. Changing the support domain (and holding everything else fixed) is not the defining characteristic of the location parameter.
Got the point. I'll make a mental note to mention that in the docs.
I'm switching to "meh" mode: I still think that allowing for the shift can lead to some troubles on the user, and I'd be in favor to modify _fix_loc_scale or something like that to force loc=0 on discrete distributions with support on positive integers, but I'll certainly not lose any sleep other that... In any case, thx a lot to y'all for your comments.
I agree that loc for distribution with a finite upper or lower support bound is confusing, at least at the beginning. It took me a while to figure out why I get some strange results with some distributions when I ran a fit over all of them until I realized that the support is shifted when loc is estimated. But I think this is mostly a documentation problem. (I still have an unresolved problem with vonmises which doesn't define it's support points, but I don't know anything at all about circular distributions.) Below is a prototype for a semi-frozen class, essentially an adapted version of the frozen class, that fixes only the location loc. (copy and paste errors still possible) However, this doesn't do anything different than the current implementation if you ignore the loc keyword. It also has the same uninformative signature which could be improved. The only real advantage I see, is, when the fit method is adjusted to take some of the parameters as fixed. Josef import numpy as np from scipy import stats class rv_frozenloc(object): def __init__(self, dist, loc=0): self.loc = loc self.dist = dist def pdf(self,x,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.pdf(x,*args,**kwds) def cdf(self,x,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.cdf(x,*args,**kwds) def ppf(self,q,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.ppf(q,*args,**kwds) def isf(self,q,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.isf(q,*args,**kwds) def rvs(self, size=None,*args,**kwds): kwds.update({'loc':self.loc, 'size':size}) return self.dist.rvs(*self.args,**kwds) def sf(self,x,*args,**kwds): return self.dist.sf(x,*args,**kwds) def stats(self, moments='mv',*args,**kwds): kwds.update({'loc':self.loc, 'moments':moments}) return self.dist.stats(*args,**kwds) def moment(self,n,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.moment(n,*args,**kwds) def entropy(self,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.entropy(*args,**kwds) def pmf(self,k,*args,**kwds): kwds.update({'loc':self.loc}) return self.dist.pmf(k,*args,**kwds) def freezeloc(dist, loc=0): return rv_frozenloc(dist, loc=loc) poiss = freezeloc(stats.poisson) print poiss.pmf(np.arange(10),5) print poiss.cdf(np.arange(10),5) print poiss.cdf(np.arange(10),5, loc=5) #this ignores loc but doesn't raise warning (yet) print stats.poisson.cdf(np.arange(10),5, loc=5) poiss5 = freezeloc(stats.poisson, loc=5) print poiss5.cdf(np.arange(10),5) norm0 = freezeloc(stats.norm, loc=1) print norm0.stats() norm0.stats(loc=0) # loc is ignored but doesn't raise warning (yet) print norm0.stats(scale=np.sqrt(2))
participants (4)
-
josef.pktd@gmail.com
-
nicky van foreest
-
Pierre GM
-
Robert Kern