scipy.stats: algorithm to for ticket 1493
Hi, I saw Ralf's suggestions on issues of scipy.stats to work on. In another project I had to solve a similar problem as the one mentioned in http://projects.scipy.org/scipy/ticket/1493. I like to propose the algorithm below to tackle this. I am looking forward to your feedback. Once the algo seems to work impeccably, I'll try to convert it such that it can be incorporated in scipy.stats and add test cases. Lets first test the example of the ticket: In [21]: from scipy.stats import invnorm In [22]: from scipy import optimize In [23]: invnorm.xb = 10 In [24]: sol = invnorm.ppf(0.8455, 7.24000019602, scale= 2.51913630166) In [25]: print sol 25.1878345282 In [26]: print invnorm.cdf(sol, 7.24000019602, scale=2.51913630166) 0.8455 In [27]: Weird that this leads to a solution, as the solution is not in the search interval. Trying xb = 5 leads to the reported error. Now a solution: # Proposed solution. # A quintessential property in the algorithm is that the cdf is an # non-decreasing function. def findppf(q): # search until xb is large enough left, right = invnorm.xa, invnorm.xb while invnorm.cdf(right, 7.24000019602, scale=2.51913630166) < q: left = right right *= 2 return optimize.brentq(lambda x: \ invnorm.cdf(x, 7.24000019602, scale=2.51913630166) - q,\ left, right) # Perhaps increasing with a larger factor than 2 is faster, as brentq # convergest very fast. Taking `right` too large is problably not a real # problem. On the other hand, when `right` increases too fast, cdf(right) may # become numerically equal to 1. So, what would be a good factor? sol = findppf(0.8455) print sol print invnorm.cdf(sol, 7.24000019602, scale=2.51913630166) This code (I ran it in a file, not in ipython) gives the right results, at least for these values. Nicky
I just realized, xa may be too large... hence we should search such that cdf(left) < q < cdf(right). *Assuming* that xa < 0 and xb > 0 the following should be better def findppf(q): # search until cdf(left) < q < cdf(right) left, right = invnorm.xa, invnorm.xb while invnorm.cdf(left, 7.24000019602, scale=2.51913630166) > q: right = left left *= 2 while invnorm.cdf(right, 7.24000019602, scale=2.51913630166) < q: left = right right *= 2 return optimize.brentq(lambda x: \ invnorm.cdf(x, 7.24000019602, scale=2.51913630166) - q,\ left, right) Should a test on xa < 0 and xb>0 be added?
On Sun, Apr 22, 2012 at 2:42 PM, nicky van foreest <vanforeest@gmail.com> wrote:
I just realized, xa may be too large... hence we should search such that cdf(left) < q < cdf(right).
*Assuming* that xa < 0 and xb > 0 the following should be better
def findppf(q): # search until cdf(left) < q < cdf(right) left, right = invnorm.xa, invnorm.xb while invnorm.cdf(left, 7.24000019602, scale=2.51913630166) > q: right = left left *= 2 while invnorm.cdf(right, 7.24000019602, scale=2.51913630166) < q: left = right right *= 2 return optimize.brentq(lambda x: \ invnorm.cdf(x, 7.24000019602, scale=2.51913630166) - q,\ left, right)
Should a test on xa < 0 and xb>0 be added?
for xa, xb it doesn't matter whether they are larger or smaller than zero, so I don't think we need a special check it looks good in a few more example cases. The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case. There is a testcase in the test suite, where I tried to roundtrip close to the 0, 1 boundary before running into failures with some distributions https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous... to try out how well tyour solution works, the roundtrip could be done with, for example, q= [1e-8, 1-1e-8] and see at which distribution it breaks and why (if any) Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf Thanks, Josef from scipy import stats, optimize def findppf(dist, q, *args): # search until cdf(left) < q < cdf(right) left, right = dist.xa, dist.xb counter = 0 while dist.cdf(left, *args) > q: right = left left *= 2 counter += 1 print counter, left, right while dist.cdf(right, *args) < q: left = right right *= 2 counter += 1 print counter, left, right return optimize.brentq(lambda x: dist.cdf(x, *args) - q, left, right) print print 'invgauss' s = 7.24000019602 sol = findppf(stats.invgauss, 0.8455, s) print sol sol = findppf(stats.invgauss, 1-1e-8, s) print 'roundtrip', 1-1e-8, sol, stats.invgauss.cdf(sol, s) print 1e-30, stats.invgauss.cdf(findppf(stats.invgauss, 1e-30, s), s) print '\nt' print findppf(stats.t, 1-1e-8, s), stats.t.ppf(1-1e-8, s) print findppf(stats.t, 1e-8, s), stats.t.ppf(1e-8, s) print '\ncauchy' print findppf(stats.cauchy, 1e-8), stats.cauchy.ppf(1e-8) print '\nf' print findppf(stats.f, 1-1e-8, 2, 10), stats.f.ppf(1-1e-8, 2, 10)
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Sun, Apr 22, 2012 at 8:27 PM, <josef.pktd@gmail.com> wrote:
On Sun, Apr 22, 2012 at 2:42 PM, nicky van foreest <vanforeest@gmail.com> wrote:
I just realized, xa may be too large... hence we should search such that cdf(left) < q < cdf(right).
*Assuming* that xa < 0 and xb > 0 the following should be better
def findppf(q): # search until cdf(left) < q < cdf(right) left, right = invnorm.xa, invnorm.xb while invnorm.cdf(left, 7.24000019602, scale=2.51913630166) > q: right = left left *= 2 while invnorm.cdf(right, 7.24000019602, scale=2.51913630166) < q: left = right right *= 2 return optimize.brentq(lambda x: \ invnorm.cdf(x, 7.24000019602, scale=2.51913630166) - q,\ left, right)
Should a test on xa < 0 and xb>0 be added?
for xa, xb it doesn't matter whether they are larger or smaller than zero, so I don't think we need a special check
it looks good in a few more example cases.
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case. There is a testcase in the test suite, where I tried to roundtrip close to the 0, 1 boundary before running into failures with some distributions https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous...
to try out how well tyour solution works, the roundtrip could be done with, for example, q= [1e-8, 1-1e-8] and see at which distribution it breaks and why (if any)
I might not have been clear. I think the patch is good, and a good improvement over the current situation with fixed xa, xb. I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this. I don't have much of an opinion about the factor, times 2 or larger. Similarly, I don't know whether the default xa and xb are good. I changed them for a few distributions, but only where I saw obvious improvements. (There is a similar expansion of the trial space in the discrete distributions where I also was just guessing how fast to go and when to stop.) Josef
Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf
Thanks,
Josef
from scipy import stats, optimize
def findppf(dist, q, *args): # search until cdf(left) < q < cdf(right)
left, right = dist.xa, dist.xb
counter = 0 while dist.cdf(left, *args) > q: right = left left *= 2 counter += 1 print counter, left, right
while dist.cdf(right, *args) < q: left = right right *= 2 counter += 1 print counter, left, right
return optimize.brentq(lambda x: dist.cdf(x, *args) - q, left, right)
print print 'invgauss' s = 7.24000019602 sol = findppf(stats.invgauss, 0.8455, s) print sol sol = findppf(stats.invgauss, 1-1e-8, s) print 'roundtrip', 1-1e-8, sol, stats.invgauss.cdf(sol, s) print 1e-30, stats.invgauss.cdf(findppf(stats.invgauss, 1e-30, s), s)
print '\nt' print findppf(stats.t, 1-1e-8, s), stats.t.ppf(1-1e-8, s) print findppf(stats.t, 1e-8, s), stats.t.ppf(1e-8, s) print '\ncauchy' print findppf(stats.cauchy, 1e-8), stats.cauchy.ppf(1e-8) print '\nf' print findppf(stats.f, 1-1e-8, 2, 10), stats.f.ppf(1-1e-8, 2, 10)
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Sun, Apr 22, 2012 at 8:44 PM, <josef.pktd@gmail.com> wrote:
On Sun, Apr 22, 2012 at 8:27 PM, <josef.pktd@gmail.com> wrote:
On Sun, Apr 22, 2012 at 2:42 PM, nicky van foreest <vanforeest@gmail.com> wrote:
I just realized, xa may be too large... hence we should search such that cdf(left) < q < cdf(right).
*Assuming* that xa < 0 and xb > 0 the following should be better
def findppf(q): # search until cdf(left) < q < cdf(right) left, right = invnorm.xa, invnorm.xb while invnorm.cdf(left, 7.24000019602, scale=2.51913630166) > q: right = left left *= 2 while invnorm.cdf(right, 7.24000019602, scale=2.51913630166) < q: left = right right *= 2 return optimize.brentq(lambda x: \ invnorm.cdf(x, 7.24000019602, scale=2.51913630166) - q,\ left, right)
Should a test on xa < 0 and xb>0 be added?
for xa, xb it doesn't matter whether they are larger or smaller than zero, so I don't think we need a special check
it looks good in a few more example cases.
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case. There is a testcase in the test suite, where I tried to roundtrip close to the 0, 1 boundary before running into failures with some distributions https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous...
to try out how well tyour solution works, the roundtrip could be done with, for example, q= [1e-8, 1-1e-8] and see at which distribution it breaks and why (if any)
I might not have been clear. I think the patch is good, and a good improvement over the current situation with fixed xa, xb. I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I don't have much of an opinion about the factor, times 2 or larger. Similarly, I don't know whether the default xa and xb are good. I changed them for a few distributions, but only where I saw obvious improvements. (There is a similar expansion of the trial space in the discrete distributions where I also was just guessing how fast to go and when to stop.)
I thought that using some adaptive heuristic to expand the trial interval might end up doing something similar to fsolve. However, your solution is faster than using fsolve. Just a quick timing in the adjusted script: brentq is much faster than fsolve for invgauss, and only a little bit slower for cauchy (fat tails) Josef
Josef
Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf
Thanks,
Josef
from scipy import stats, optimize
def findppf(dist, q, *args): # search until cdf(left) < q < cdf(right)
left, right = dist.xa, dist.xb
counter = 0 while dist.cdf(left, *args) > q: right = left left *= 2 counter += 1 print counter, left, right
while dist.cdf(right, *args) < q: left = right right *= 2 counter += 1 print counter, left, right
return optimize.brentq(lambda x: dist.cdf(x, *args) - q, left, right)
print print 'invgauss' s = 7.24000019602 sol = findppf(stats.invgauss, 0.8455, s) print sol sol = findppf(stats.invgauss, 1-1e-8, s) print 'roundtrip', 1-1e-8, sol, stats.invgauss.cdf(sol, s) print 1e-30, stats.invgauss.cdf(findppf(stats.invgauss, 1e-30, s), s)
print '\nt' print findppf(stats.t, 1-1e-8, s), stats.t.ppf(1-1e-8, s) print findppf(stats.t, 1e-8, s), stats.t.ppf(1e-8, s) print '\ncauchy' print findppf(stats.cauchy, 1e-8), stats.cauchy.ppf(1e-8) print '\nf' print findppf(stats.f, 1-1e-8, 2, 10), stats.f.ppf(1-1e-8, 2, 10)
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
for xa, xb it doesn't matter whether they are larger or smaller than
zero, so I don't think we need a special check
I think it does, for suppose that in the algo left = xa = 0.5 (because the user has been fiddling with xa) and cdf(xa) > q. Then setting left = 2*left will only worsen the problem. Or do I miss something?
it looks good in a few more example cases.
I found another small bug, please see the included code.
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
Similarly, I don't know whether the default xa and xb are good. I changed them for a few distributions, but only where I saw obvious improvements.
I also have no clue what would be good values in general. The choices seems reasonable from a practical point of view...
Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf
Thanks. I included also **kwds so that I can pass scale = 10 or something like this. Once all works as it should, I'll try to convert the code such that it fits nicely in distributions.py. The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better. With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so. The code contains two choices about how to handle xa and xb. Do you have any preference? Thanks for your feedback. Very helpful.
On Mon, Apr 23, 2012 at 2:58 PM, nicky van foreest <vanforeest@gmail.com> wrote:
for xa, xb it doesn't matter whether they are larger or smaller than
zero, so I don't think we need a special check
I think it does, for suppose that in the algo left = xa = 0.5 (because the user has been fiddling with xa) and cdf(xa) > q. Then setting left = 2*left will only worsen the problem. Or do I miss something?
True, however I don't think we have any predefined xa and xb that both are strictly positive or negative values. pareto is the only distribution bounded away from zero that I know and it has xa = -10
it looks good in a few more example cases.
I found another small bug, please see the included code.
later today
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions quad should run from dist.a to x, I guess, dist.a might be -inf
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
findppf(stats.expon, 1e-30) -6.3593574850511882e-13
lower bound q can be small and won't run into problems with being 0, until 1e-300? The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
Similarly, I don't know whether the default xa and xb are good. I changed them for a few distributions, but only where I saw obvious improvements.
I also have no clue what would be good values in general. The choices seems reasonable from a practical point of view...
Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf
Thanks. I included also **kwds so that I can pass scale = 10 or something like this. Once all works as it should, I'll try to convert the code such that it fits nicely in distributions.py.
with self instead of dist, it should already have the signature about right, no **kwds I assume
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ? I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer. Or do I miss the point?
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore. another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though. Thanks for working on this, Josef
Thanks for your feedback. Very helpful.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
I'll get back to this tomorrow evening. I promised to finish something today. With respect to computing the sum of two random variables: this turns out to be quite a challenge. To resolve this I decided to formulate it as a possible assignment for students... Thus, this takes somewhat longer than I expected. On the other hand, hopefully it leads to some more scipy converts... On 23 April 2012 22:04, <josef.pktd@gmail.com> wrote:
On Mon, Apr 23, 2012 at 2:58 PM, nicky van foreest <vanforeest@gmail.com> wrote:
for xa, xb it doesn't matter whether they are larger or smaller than
zero, so I don't think we need a special check
I think it does, for suppose that in the algo left = xa = 0.5 (because the user has been fiddling with xa) and cdf(xa) > q. Then setting left = 2*left will only worsen the problem. Or do I miss something?
True, however I don't think we have any predefined xa and xb that both are strictly positive or negative values. pareto is the only distribution bounded away from zero that I know and it has xa = -10
it looks good in a few more example cases.
I found another small bug, please see the included code.
later today
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
quad should run from dist.a to x, I guess, dist.a might be -inf
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
findppf(stats.expon, 1e-30) -6.3593574850511882e-13
lower bound q can be small and won't run into problems with being 0, until 1e-300?
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
Similarly, I don't know whether the default xa and xb are good. I changed them for a few distributions, but only where I saw obvious improvements.
I also have no clue what would be good values in general. The choices seems reasonable from a practical point of view...
Note: I removed the scale in your example, because internal _ppf works on the standard distribution, loc=0, scale=1. loc and scale are added generically in .ppf
Thanks. I included also **kwds so that I can pass scale = 10 or something like this. Once all works as it should, I'll try to convert the code such that it fits nicely in distributions.py.
with self instead of dist, it should already have the signature about right, no **kwds I assume
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
Thanks for working on this,
Josef
Thanks for your feedback. Very helpful.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
I see, sure.
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
findppf(stats.expon, 1e-30) -6.3593574850511882e-13
This result shows actually that xa and xb are necessary to include in the specification of the distribution. The exponential distribution is (usually) defined only on [0, \infty) not on the negative numbers. The result above is negative though. This is of course a simple consequence of calling brentq. From a user's perspective, though, I would become very suspicious about this negative result.
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
I agree, provided xa and xb are always properly defined. But then, (just to be nitpicking), the definition of expon does not set xa and xb explicitly. Hence xa = -10, and this is somewhat undesirable, given the negative value above.
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
Sure.
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
No, you are right. When I wrote this at first, I also thought about the point you bring up here. Then, I was somewhat dissatisfied with calling the while loop twice (suppose the left bound requires updating, then certainly the second while loop (to update the right bound) is unnecessary, and calling cdf(right) is useless). While trying to fix this, I forgot about my initial ideas...
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
I think that would be better. Thus, the developer that subclasses rv_continuous should set _xa and _xb properly.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
I think that this makes the most sense. The definition of the class should include sensible values of xa and xb. All in all, I would like to make the following proposal to resolve the ticket in a generic way. 1) xa and xb should become private class members _xa and _xb 2) _xa and _xb should be given proper values in the class definition, e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14. 3) given a quantile q in the ppf function, include a test on _cdf(_xa) <= q <= _cdf(_xb). If this fails, return an exception with some text that tells that either _cdf(_xa) > q or _cdf(_xb) < q. Given your comments I actually favor all this searching for left and right not that much anymore. It is generic, but it places the responsibility of good code at the wrong place. Nicky
On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest@gmail.com> wrote:
The difficult cases will be where cdf also doesn't exist and we need to get it through integrate.quad, but I don't remember which distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
I see, sure.
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
findppf(stats.expon, 1e-30) -6.3593574850511882e-13
This result shows actually that xa and xb are necessary to include in the specification of the distribution. The exponential distribution is (usually) defined only on [0, \infty) not on the negative numbers. The result above is negative though. This is of course a simple consequence of calling brentq. From a user's perspective, though, I would become very suspicious about this negative result.
good argument to clean up xa, xb
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
I agree, provided xa and xb are always properly defined. But then, (just to be nitpicking), the definition of expon does not set xa and xb explicitly. Hence xa = -10, and this is somewhat undesirable, given the negative value above.
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
Sure.
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
No, you are right. When I wrote this at first, I also thought about the point you bring up here. Then, I was somewhat dissatisfied with calling the while loop twice (suppose the left bound requires updating, then certainly the second while loop (to update the right bound) is unnecessary, and calling cdf(right) is useless). While trying to fix this, I forgot about my initial ideas...
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
I think that would be better. Thus, the developer that subclasses rv_continuous should set _xa and _xb properly.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
I think that this makes the most sense. The definition of the class should include sensible values of xa and xb.
All in all, I would like to make the following proposal to resolve the ticket in a generic way.
1) xa and xb should become private class members _xa and _xb 2) _xa and _xb should be given proper values in the class definition, e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14. 3) given a quantile q in the ppf function, include a test on _cdf(_xa) <= q <= _cdf(_xb). If this fails, return an exception with some text that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
Given your comments I actually favor all this searching for left and right not that much anymore. It is generic, but it places the responsibility of good code at the wrong place.
3) I prefer your expanding the search to raising an exception to the user. Note also that your 3) is inconsistent with 1). If a user visible exception is raised, then the user needs to change xa or xb, so it shouldn't be private. That's the current situation (except for a more cryptic message). 2) I'm all in favor, especially for one-side bound distributions, where it should be easy to go through those. There might be a few where the bound moves with the shape, but the only one I remember is genextreme and that has an explicit _ppf So I would prefer 1), 2) and your new enhanced generic _ppf Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Apr 25, 2012 at 3:49 PM, <josef.pktd@gmail.com> wrote:
On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest@gmail.com> wrote:
> The difficult cases will be where cdf also doesn't exist and we need > to get it through integrate.quad, but I don't remember which > distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
I see, sure.
I just think that we are not able to reach the q=0, q=1 boundaries, since for some distributions we will run into other numerical problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
findppf(stats.expon, 1e-30) -6.3593574850511882e-13
This result shows actually that xa and xb are necessary to include in the specification of the distribution. The exponential distribution is (usually) defined only on [0, \infty) not on the negative numbers. The result above is negative though. This is of course a simple consequence of calling brentq. From a user's perspective, though, I would become very suspicious about this negative result.
good argument to clean up xa, xb
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
I agree, provided xa and xb are always properly defined. But then, (just to be nitpicking), the definition of expon does not set xa and xb explicitly. Hence xa = -10, and this is somewhat undesirable, given the negative value above.
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
Sure.
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
No, you are right. When I wrote this at first, I also thought about the point you bring up here. Then, I was somewhat dissatisfied with calling the while loop twice (suppose the left bound requires updating, then certainly the second while loop (to update the right bound) is unnecessary, and calling cdf(right) is useless). While trying to fix this, I forgot about my initial ideas...
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
I think that would be better. Thus, the developer that subclasses rv_continuous should set _xa and _xb properly.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
I think that this makes the most sense. The definition of the class should include sensible values of xa and xb.
All in all, I would like to make the following proposal to resolve the ticket in a generic way.
1) xa and xb should become private class members _xa and _xb 2) _xa and _xb should be given proper values in the class definition, e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14. 3) given a quantile q in the ppf function, include a test on _cdf(_xa) <= q <= _cdf(_xb). If this fails, return an exception with some text that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
Given your comments I actually favor all this searching for left and right not that much anymore. It is generic, but it places the responsibility of good code at the wrong place.
3) I prefer your expanding the search to raising an exception to the user. Note also that your 3) is inconsistent with 1). If a user visible exception is raised, then the user needs to change xa or xb, so it shouldn't be private. That's the current situation (except for a more cryptic message).
2) I'm all in favor, especially for one-side bound distributions, where it should be easy to go through those. There might be a few where the bound moves with the shape, but the only one I remember is genextreme and that has an explicit _ppf
So I would prefer 1), 2) and your new enhanced generic _ppf
forgot to mention the main reason that I like your expanding search space is that the shape of the distribution can change a lot. Even if we set xa, xb to reasonable values for likely shape parameters they won't be good enough for others, as in the original ticket
stats.invgauss.stats(2) (array(2.0), array(8.0)) stats.invgauss.stats(7) (array(7.0), array(343.0)) stats.invgauss.stats(20) (array(20.0), array(8000.0)) stats.invgauss.stats(100) (array(100.0), array(1000000.0)) stats.invgauss.cdf(1000, 100) 0.98335562794321207
findppf(stats.invgauss, 0.99, 100) 1926.520850319389 findppf(stats.invgauss, 0.999, 100) 13928.012903371644 findppf(stats.invgauss, 0.999, 1) 8.3548649291400938
to get a rough idea: for xa, xb and a finite bound either left or right, all have generic xa=-10 or xb=10
dist_cont = [getattr(stats.distributions, dname) for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]
left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)] pprint(left) [('alpha', 0.0, -10.0), ('anglit', -0.78539816339744828, -10.0), ('arcsine', 0.0, -10.0), ('beta', 0.0, -10.0), ('betaprime', 0.0, -10.0), ('bradford', 0.0, -10.0), ('burr', 0.0, -10.0), ('chi', 0.0, -10.0), ('chi2', 0.0, -10.0), ('cosine', -3.1415926535897931, -10.0), ('erlang', 0.0, -10.0), ('expon', 0.0, 0), ('exponpow', 0.0, -10.0), ('exponweib', 0.0, -10.0), ('f', 0.0, -10.0), ('fatiguelife', 0.0, -10.0), ('fisk', 0.0, -10.0), ('foldcauchy', 0.0, -10.0), ('foldnorm', 0.0, -10.0), ('frechet_r', 0.0, -10.0), ('gamma', 0.0, -10.0), ('gausshyper', 0.0, -10.0), ('genexpon', 0.0, -10.0), ('gengamma', 0.0, -10.0), ('genhalflogistic', 0.0, -10.0), ('genpareto', 0.0, -10.0), ('gilbrat', 0.0, -10.0), ('gompertz', 0.0, -10.0), ('halfcauchy', 0.0, -10.0), ('halflogistic', 0.0, -10.0), ('halfnorm', 0.0, -10.0), ('invgamma', 0.0, -10.0), ('invgauss', 0.0, -10.0), ('invnorm', 0.0, -10.0), ('invweibull', 0, -10.0), ('johnsonb', 0.0, -10.0), ('ksone', 0.0, -10.0), ('kstwobign', 0.0, -10.0), ('levy', 0.0, -10.0), ('loglaplace', 0.0, -10.0), ('lognorm', 0.0, -10.0), ('lomax', 0.0, -10.0), ('maxwell', 0.0, -10.0), ('mielke', 0.0, -10.0), ('nakagami', 0.0, -10.0), ('ncf', 0.0, -10.0), ('ncx2', 0.0, -10.0), ('pareto', 1.0, -10.0), ('powerlaw', 0.0, -10.0), ('powerlognorm', 0.0, -10.0), ('rayleigh', 0.0, -10.0), ('rdist', -1.0, -10.0), ('recipinvgauss', 0.0, -10.0), ('rice', 0.0, -10.0), ('semicircular', -1.0, -10.0), ('triang', 0.0, -10.0), ('truncexpon', 0.0, -10.0), ('uniform', 0.0, -10.0), ('wald', 0.0, -10.0), ('weibull_min', 0.0, -10.0), ('wrapcauchy', 0.0, -10.0)]
right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)] pprint(right) [('anglit', 0.78539816339744828, 10.0), ('arcsine', 1.0, 10.0), ('beta', 1.0, 10.0), ('betaprime', 500.0, 10.0), ('bradford', 1.0, 10.0), ('cosine', 3.1415926535897931, 10.0), ('frechet_l', 0.0, 10.0), ('gausshyper', 1.0, 10.0), ('johnsonb', 1.0, 10.0), ('levy_l', 0.0, 10.0), ('powerlaw', 1.0, 10.0), ('rdist', 1.0, 10.0), ('semicircular', 1.0, 10.0), ('triang', 1.0, 10.0), ('uniform', 1.0, 10.0), ('weibull_max', 0.0, 10.0), ('wrapcauchy', 6.2831853071795862, 10.0)]
only pareto has both limits on the same side of zero
pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0]) [('pareto', 1.0, inf)]
genextreme, and maybe one or two others, are missing because finite a, b are set in _argcheck vonmises is for circular and doesn't behave properly only two distributions define non-generic xa or xb
pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100]) [('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]
a pull request setting correct xa, xb would be very welcome Josef
Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Hi Josef, Some time ago we discussed an algorithm to find x such that cdf(x) = q for given q, that is, a generic algorithm to solve the ppf. In the mail below you propose to set xa and xb to good initial values, but this is not a simple task. Besides this, xa and xb are solely used in the ppf compution. This made me think about an algorithm that avoid the use of xa and xb altogether. I came up with the algorithm below. The included test cases show that it works really well. What is your opinion? In case motivation is required, I shift left and right such that eventually cdf(left) < q < cdf(right). I update by a factor of 10. I rather don't spend a lot of time on the while loop, but prefer to leave the actual solving to brentq. This algo homes in very fast, so it is better to rely on brentq to do the fast searching. Hence, I prefer to take 10 rather than 2 or 3 at a growth factor. Just for fun I tried 1000 as a factor. This works also well. Do you perhaps have more challenging cases? Once we have a set of tests, I'll try to set up a performance test. Once we are satisfied I'll make a pull request. I attached the code. BTW. I suppose that as long as we are experimenting with algorithms/code there is no reason to make a pull request. Nicky On 26 April 2012 05:15, <josef.pktd@gmail.com> wrote:
On Wed, Apr 25, 2012 at 3:49 PM, <josef.pktd@gmail.com> wrote:
On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest@gmail.com> wrote:
>> The difficult cases will be where cdf also doesn't exist and we need >> to get it through integrate.quad, but I don't remember which >> distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
I see, sure.
> I just think that we are not able to reach the q=0, q=1 boundaries, > since for some distributions we will run into other numerical > problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
> findppf(stats.expon, 1e-30) -6.3593574850511882e-13
This result shows actually that xa and xb are necessary to include in the specification of the distribution. The exponential distribution is (usually) defined only on [0, \infty) not on the negative numbers. The result above is negative though. This is of course a simple consequence of calling brentq. From a user's perspective, though, I would become very suspicious about this negative result.
good argument to clean up xa, xb
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
I agree, provided xa and xb are always properly defined. But then, (just to be nitpicking), the definition of expon does not set xa and xb explicitly. Hence xa = -10, and this is somewhat undesirable, given the negative value above.
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
Sure.
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
No, you are right. When I wrote this at first, I also thought about the point you bring up here. Then, I was somewhat dissatisfied with calling the while loop twice (suppose the left bound requires updating, then certainly the second while loop (to update the right bound) is unnecessary, and calling cdf(right) is useless). While trying to fix this, I forgot about my initial ideas...
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
I think that would be better. Thus, the developer that subclasses rv_continuous should set _xa and _xb properly.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
I think that this makes the most sense. The definition of the class should include sensible values of xa and xb.
All in all, I would like to make the following proposal to resolve the ticket in a generic way.
1) xa and xb should become private class members _xa and _xb 2) _xa and _xb should be given proper values in the class definition, e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14. 3) given a quantile q in the ppf function, include a test on _cdf(_xa) <= q <= _cdf(_xb). If this fails, return an exception with some text that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
Given your comments I actually favor all this searching for left and right not that much anymore. It is generic, but it places the responsibility of good code at the wrong place.
3) I prefer your expanding the search to raising an exception to the user. Note also that your 3) is inconsistent with 1). If a user visible exception is raised, then the user needs to change xa or xb, so it shouldn't be private. That's the current situation (except for a more cryptic message).
2) I'm all in favor, especially for one-side bound distributions, where it should be easy to go through those. There might be a few where the bound moves with the shape, but the only one I remember is genextreme and that has an explicit _ppf
So I would prefer 1), 2) and your new enhanced generic _ppf
forgot to mention
the main reason that I like your expanding search space is that the shape of the distribution can change a lot. Even if we set xa, xb to reasonable values for likely shape parameters they won't be good enough for others, as in the original ticket
stats.invgauss.stats(2) (array(2.0), array(8.0)) stats.invgauss.stats(7) (array(7.0), array(343.0)) stats.invgauss.stats(20) (array(20.0), array(8000.0)) stats.invgauss.stats(100) (array(100.0), array(1000000.0)) stats.invgauss.cdf(1000, 100) 0.98335562794321207
findppf(stats.invgauss, 0.99, 100) 1926.520850319389 findppf(stats.invgauss, 0.999, 100) 13928.012903371644 findppf(stats.invgauss, 0.999, 1) 8.3548649291400938
to get a rough idea: for xa, xb and a finite bound either left or right, all have generic xa=-10 or xb=10
dist_cont = [getattr(stats.distributions, dname) for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]
left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)] pprint(left) [('alpha', 0.0, -10.0), ('anglit', -0.78539816339744828, -10.0), ('arcsine', 0.0, -10.0), ('beta', 0.0, -10.0), ('betaprime', 0.0, -10.0), ('bradford', 0.0, -10.0), ('burr', 0.0, -10.0), ('chi', 0.0, -10.0), ('chi2', 0.0, -10.0), ('cosine', -3.1415926535897931, -10.0), ('erlang', 0.0, -10.0), ('expon', 0.0, 0), ('exponpow', 0.0, -10.0), ('exponweib', 0.0, -10.0), ('f', 0.0, -10.0), ('fatiguelife', 0.0, -10.0), ('fisk', 0.0, -10.0), ('foldcauchy', 0.0, -10.0), ('foldnorm', 0.0, -10.0), ('frechet_r', 0.0, -10.0), ('gamma', 0.0, -10.0), ('gausshyper', 0.0, -10.0), ('genexpon', 0.0, -10.0), ('gengamma', 0.0, -10.0), ('genhalflogistic', 0.0, -10.0), ('genpareto', 0.0, -10.0), ('gilbrat', 0.0, -10.0), ('gompertz', 0.0, -10.0), ('halfcauchy', 0.0, -10.0), ('halflogistic', 0.0, -10.0), ('halfnorm', 0.0, -10.0), ('invgamma', 0.0, -10.0), ('invgauss', 0.0, -10.0), ('invnorm', 0.0, -10.0), ('invweibull', 0, -10.0), ('johnsonb', 0.0, -10.0), ('ksone', 0.0, -10.0), ('kstwobign', 0.0, -10.0), ('levy', 0.0, -10.0), ('loglaplace', 0.0, -10.0), ('lognorm', 0.0, -10.0), ('lomax', 0.0, -10.0), ('maxwell', 0.0, -10.0), ('mielke', 0.0, -10.0), ('nakagami', 0.0, -10.0), ('ncf', 0.0, -10.0), ('ncx2', 0.0, -10.0), ('pareto', 1.0, -10.0), ('powerlaw', 0.0, -10.0), ('powerlognorm', 0.0, -10.0), ('rayleigh', 0.0, -10.0), ('rdist', -1.0, -10.0), ('recipinvgauss', 0.0, -10.0), ('rice', 0.0, -10.0), ('semicircular', -1.0, -10.0), ('triang', 0.0, -10.0), ('truncexpon', 0.0, -10.0), ('uniform', 0.0, -10.0), ('wald', 0.0, -10.0), ('weibull_min', 0.0, -10.0), ('wrapcauchy', 0.0, -10.0)]
right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)] pprint(right) [('anglit', 0.78539816339744828, 10.0), ('arcsine', 1.0, 10.0), ('beta', 1.0, 10.0), ('betaprime', 500.0, 10.0), ('bradford', 1.0, 10.0), ('cosine', 3.1415926535897931, 10.0), ('frechet_l', 0.0, 10.0), ('gausshyper', 1.0, 10.0), ('johnsonb', 1.0, 10.0), ('levy_l', 0.0, 10.0), ('powerlaw', 1.0, 10.0), ('rdist', 1.0, 10.0), ('semicircular', 1.0, 10.0), ('triang', 1.0, 10.0), ('uniform', 1.0, 10.0), ('weibull_max', 0.0, 10.0), ('wrapcauchy', 6.2831853071795862, 10.0)]
only pareto has both limits on the same side of zero
pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0]) [('pareto', 1.0, inf)]
genextreme, and maybe one or two others, are missing because finite a, b are set in _argcheck vonmises is for circular and doesn't behave properly
only two distributions define non-generic xa or xb
pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100]) [('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]
a pull request setting correct xa, xb would be very welcome
Josef
Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Sun, May 13, 2012 at 4:34 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Hi Josef,
Some time ago we discussed an algorithm to find x such that cdf(x) = q for given q, that is, a generic algorithm to solve the ppf. In the mail below you propose to set xa and xb to good initial values, but this is not a simple task. Besides this, xa and xb are solely used in the ppf compution. This made me think about an algorithm that avoid the use of xa and xb altogether. I came up with the algorithm below. The included test cases show that it works really well. What is your opinion?
In case motivation is required, I shift left and right such that eventually cdf(left) < q < cdf(right). I update by a factor of 10. I rather don't spend a lot of time on the while loop, but prefer to leave the actual solving to brentq. This algo homes in very fast, so it is better to rely on brentq to do the fast searching. Hence, I prefer to take 10 rather than 2 or 3 at a growth factor. Just for fun I tried 1000 as a factor. This works also well.
Do you perhaps have more challenging cases? Once we have a set of tests, I'll try to set up a performance test.
brentq seems to work very well one case I was worried about, but works well:
p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 1000, full_output=1) p -1.2079226507921669e-12 r.iterations 53 r.function_calls 54 p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -100, 100, full_output=1) p -9.0949470177290643e-13 r.iterations 50 r.function_calls 51 p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 10, full_output=1) r.iterations 71 r.function_calls 72
aside: if q = 0 or q=1, then brentq will find something outside the support, but this should be handled by the ppf generic code class Fake(object): #actually uniform def cdf(wlf, x): return np.minimum(1, np.maximum(1000+x,0)) q = 1e-6 d = Fake() sol = findppf(d, q) print sol, q, d.cdf(sol) -999.999999 1e-06 9.99999997475e-07 Right now I cannot think of another case that would be difficult. I don't see anything yet to criticize in your latest version :( Josef
Once we are satisfied I'll make a pull request.
I attached the code. BTW. I suppose that as long as we are experimenting with algorithms/code there is no reason to make a pull request.
Nicky
On 26 April 2012 05:15, <josef.pktd@gmail.com> wrote:
On Wed, Apr 25, 2012 at 3:49 PM, <josef.pktd@gmail.com> wrote:
On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest@gmail.com> wrote:
>>> The difficult cases will be where cdf also doesn't exist and we need >>> to get it through integrate.quad, but I don't remember which >>> distribution is a good case.
This case is harder indeed. (I assume you mean by 'not exist' that there is no closed form expression for the cdf, like the normal distribution). Computing the ppf would involve calling quad a lot of times. This is wasteful especially since the computation of cdf(b) includes the computation of cdf(a) for a < b, supposing that quad runs from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) + quad(f, a, b), assuming that cdf(a) has been computed already. (perhaps I am not clear enough here. If so, let me know.)
not exists = not defined as _cdf method could also be scipy.special if there are no closed form expressions
I see, sure.
>> I just think that we are not able to reach the q=0, q=1 boundaries, >> since for some distributions we will run into other numerical >> problems. And I'm curious how far we can get with this.
I completely missed to include a test on the obvious cases q >= 1. - np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the attached file.
>> findppf(stats.expon, 1e-30) -6.3593574850511882e-13
This result shows actually that xa and xb are necessary to include in the specification of the distribution. The exponential distribution is (usually) defined only on [0, \infty) not on the negative numbers. The result above is negative though. This is of course a simple consequence of calling brentq. From a user's perspective, though, I would become very suspicious about this negative result.
good argument to clean up xa, xb
The right answer should be dist.b for q=numerically 1, lower support point is dist.a but I don't see when we would need it.
I agree, provided xa and xb are always properly defined. But then, (just to be nitpicking), the definition of expon does not set xa and xb explicitly. Hence xa = -10, and this is somewhat undesirable, given the negative value above.
The simultaneous updating of left and right in the previous algo is wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both left and right would `move to the left'. This is clearly wrong. The included code should be better.
would move to the *right* ?
Sure.
I thought the original was a nice trick, we can shift both left and right since we know it has to be in that direction, the cut of range cannot contain the answer.
Or do I miss the point?
No, you are right. When I wrote this at first, I also thought about the point you bring up here. Then, I was somewhat dissatisfied with calling the while loop twice (suppose the left bound requires updating, then certainly the second while loop (to update the right bound) is unnecessary, and calling cdf(right) is useless). While trying to fix this, I forgot about my initial ideas...
With regard to the values of xb and xa. Can a `ordinary' user change these? If so, then the ppf finder should include some protection in my opinion. If not, the user will get an error that brentq has not the right limits, but this error might be somewhat unexpected. (What has brentq to do with finding the ppf?) Of course, looking at the code this is clear, but I expect most users will no do so.
I don't think `ordinary' users should touch xa, xb, but they could. Except for getting around the limitation in this ticket there is no reason to change xa, xb, so we could make them private _xa, _xb instead.
I think that would be better. Thus, the developer that subclasses rv_continuous should set _xa and _xb properly.
The code contains two choices about how to handle xa and xb. Do you have any preference?
I don't really like choice 1, because it removes the use of the predefined xa, xb. On the other hand, with this extension, xa and xb wouldn't be really necessary anymore.
In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
another possibility would be to try except brentq with xa, xb first and get most cases, and switch to your version if needed. I'm not sure xa, xb are defined well enough that it's worth to go this route, though.
I think that this makes the most sense. The definition of the class should include sensible values of xa and xb.
All in all, I would like to make the following proposal to resolve the ticket in a generic way.
1) xa and xb should become private class members _xa and _xb 2) _xa and _xb should be given proper values in the class definition, e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14. 3) given a quantile q in the ppf function, include a test on _cdf(_xa) <= q <= _cdf(_xb). If this fails, return an exception with some text that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
Given your comments I actually favor all this searching for left and right not that much anymore. It is generic, but it places the responsibility of good code at the wrong place.
3) I prefer your expanding the search to raising an exception to the user. Note also that your 3) is inconsistent with 1). If a user visible exception is raised, then the user needs to change xa or xb, so it shouldn't be private. That's the current situation (except for a more cryptic message).
2) I'm all in favor, especially for one-side bound distributions, where it should be easy to go through those. There might be a few where the bound moves with the shape, but the only one I remember is genextreme and that has an explicit _ppf
So I would prefer 1), 2) and your new enhanced generic _ppf
forgot to mention
the main reason that I like your expanding search space is that the shape of the distribution can change a lot. Even if we set xa, xb to reasonable values for likely shape parameters they won't be good enough for others, as in the original ticket
stats.invgauss.stats(2) (array(2.0), array(8.0)) stats.invgauss.stats(7) (array(7.0), array(343.0)) stats.invgauss.stats(20) (array(20.0), array(8000.0)) stats.invgauss.stats(100) (array(100.0), array(1000000.0)) stats.invgauss.cdf(1000, 100) 0.98335562794321207
findppf(stats.invgauss, 0.99, 100) 1926.520850319389 findppf(stats.invgauss, 0.999, 100) 13928.012903371644 findppf(stats.invgauss, 0.999, 1) 8.3548649291400938
to get a rough idea: for xa, xb and a finite bound either left or right, all have generic xa=-10 or xb=10
dist_cont = [getattr(stats.distributions, dname) for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]
left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)] pprint(left) [('alpha', 0.0, -10.0), ('anglit', -0.78539816339744828, -10.0), ('arcsine', 0.0, -10.0), ('beta', 0.0, -10.0), ('betaprime', 0.0, -10.0), ('bradford', 0.0, -10.0), ('burr', 0.0, -10.0), ('chi', 0.0, -10.0), ('chi2', 0.0, -10.0), ('cosine', -3.1415926535897931, -10.0), ('erlang', 0.0, -10.0), ('expon', 0.0, 0), ('exponpow', 0.0, -10.0), ('exponweib', 0.0, -10.0), ('f', 0.0, -10.0), ('fatiguelife', 0.0, -10.0), ('fisk', 0.0, -10.0), ('foldcauchy', 0.0, -10.0), ('foldnorm', 0.0, -10.0), ('frechet_r', 0.0, -10.0), ('gamma', 0.0, -10.0), ('gausshyper', 0.0, -10.0), ('genexpon', 0.0, -10.0), ('gengamma', 0.0, -10.0), ('genhalflogistic', 0.0, -10.0), ('genpareto', 0.0, -10.0), ('gilbrat', 0.0, -10.0), ('gompertz', 0.0, -10.0), ('halfcauchy', 0.0, -10.0), ('halflogistic', 0.0, -10.0), ('halfnorm', 0.0, -10.0), ('invgamma', 0.0, -10.0), ('invgauss', 0.0, -10.0), ('invnorm', 0.0, -10.0), ('invweibull', 0, -10.0), ('johnsonb', 0.0, -10.0), ('ksone', 0.0, -10.0), ('kstwobign', 0.0, -10.0), ('levy', 0.0, -10.0), ('loglaplace', 0.0, -10.0), ('lognorm', 0.0, -10.0), ('lomax', 0.0, -10.0), ('maxwell', 0.0, -10.0), ('mielke', 0.0, -10.0), ('nakagami', 0.0, -10.0), ('ncf', 0.0, -10.0), ('ncx2', 0.0, -10.0), ('pareto', 1.0, -10.0), ('powerlaw', 0.0, -10.0), ('powerlognorm', 0.0, -10.0), ('rayleigh', 0.0, -10.0), ('rdist', -1.0, -10.0), ('recipinvgauss', 0.0, -10.0), ('rice', 0.0, -10.0), ('semicircular', -1.0, -10.0), ('triang', 0.0, -10.0), ('truncexpon', 0.0, -10.0), ('uniform', 0.0, -10.0), ('wald', 0.0, -10.0), ('weibull_min', 0.0, -10.0), ('wrapcauchy', 0.0, -10.0)]
right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)] pprint(right) [('anglit', 0.78539816339744828, 10.0), ('arcsine', 1.0, 10.0), ('beta', 1.0, 10.0), ('betaprime', 500.0, 10.0), ('bradford', 1.0, 10.0), ('cosine', 3.1415926535897931, 10.0), ('frechet_l', 0.0, 10.0), ('gausshyper', 1.0, 10.0), ('johnsonb', 1.0, 10.0), ('levy_l', 0.0, 10.0), ('powerlaw', 1.0, 10.0), ('rdist', 1.0, 10.0), ('semicircular', 1.0, 10.0), ('triang', 1.0, 10.0), ('uniform', 1.0, 10.0), ('weibull_max', 0.0, 10.0), ('wrapcauchy', 6.2831853071795862, 10.0)]
only pareto has both limits on the same side of zero
pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0]) [('pareto', 1.0, inf)]
genextreme, and maybe one or two others, are missing because finite a, b are set in _argcheck vonmises is for circular and doesn't behave properly
only two distributions define non-generic xa or xb
pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100]) [('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]
a pull request setting correct xa, xb would be very welcome
Josef
Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
one case I was worried about, but works well:
p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 1000, full_output=1) p -1.2079226507921669e-12
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests. It seems that these need not be changed. Thus I propose to do the following - make a new branch - repair for the cases q = 0 and q = 1 by means of an explicit test. - implement findppf in a suitable way in distributions.py - remove xa and xb - send a pull request In case this list is not complete, please let me know. Otherwise you'll see the pull request Nicky In case I m
On Mon, May 14, 2012 at 1:56 PM, nicky van foreest <vanforeest@gmail.com> wrote:
one case I was worried about, but works well:
p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 1000, full_output=1) p -1.2079226507921669e-12
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
It seems that these need not be changed. Thus I propose to do the following
- make a new branch - repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think ppf(0) = self.a ppf(1) = self.b
- implement findppf in a suitable way in distributions.py - remove xa and xb - send a pull request
In case this list is not complete, please let me know. Otherwise you'll see the pull request
sounds good. Josef <maintenance worker for statsmodels>
Nicky
In case I m _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I included this. I also think that a check on whether left and right stay within self.a and self.b should be included, perhaps just for safety reasons.
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
Good idea. Just to see what would happen I changed the following code in test_continuous_basic.py: @_silence_fp_errors def check_cdf_ppf(distfn,arg,msg): values = [-1.e-5, 0.,0.001,0.5,0.999,1.] npt.assert_almost_equal(distfn.cdf(distfn.ppf(values, *arg), *arg), values, decimal=DECIMAL, err_msg= msg + \ ' - cdf-ppf roundtrip') Thus, I changed the values into an array. It should fail on the first value, as it is negative, but I get a pass. Specifically, I ran: nicky@chuck:~/prog/scipy/scipy/stats/tests$ python test_continuous_basic.py .............................................................................................................................. ---------------------------------------------------------------------- Ran 126 tests in 93.990s OK
Weird result. If I add a q = 1.0000001 I get a fail on the fourth test, as expected.
- repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think
Actually, from the code in lines: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152... I am inclined to believe you. However, in view of the above test ... Might it be that the conditions on L1529 have been added quite recently, and did not yet make it to my machine? I'll check this right now....As a matter of fact, my distributions.py contains the same check, i.e., cond1 = (q > 0) & (q < 1) . Hmmm. Now I admit that I do not understand in all nitty-gritty detail the entire implementation of ppf(), but I suspect that this is a bug.
ppf(0) = self.a ppf(1) = self.b
Good idea. I'll implement the code in my branch, and do a pull request.
On Mon, May 14, 2012 at 2:45 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I included this. I also think that a check on whether left and right stay within self.a and self.b should be included, perhaps just for safety reasons.
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
Good idea. Just to see what would happen I changed the following code in test_continuous_basic.py:
@_silence_fp_errors def check_cdf_ppf(distfn,arg,msg): values = [-1.e-5, 0.,0.001,0.5,0.999,1.] npt.assert_almost_equal(distfn.cdf(distfn.ppf(values, *arg), *arg), values, decimal=DECIMAL, err_msg= msg + \ ' - cdf-ppf roundtrip')
roundtrip: looks like ppf should be ok, but cdf is not
stats.norm.ppf(-1e-5) nan stats.norm.cdf(np.nan) 0.0 stats.norm.cdf(stats.norm.ppf(-1e-5)) 0.0
I'm using scipy 0.9. but I don't think this has changed, not that I know of I'm trying to track down when this got changed. (github doesn't show changes in a file that has too many changes, need to dig out git)
Thus, I changed the values into an array. It should fail on the first value, as it is negative, but I get a pass. Specifically, I ran:
nicky@chuck:~/prog/scipy/scipy/stats/tests$ python test_continuous_basic.py .............................................................................................................................. ---------------------------------------------------------------------- Ran 126 tests in 93.990s
OK
Weird result. If I add a q = 1.0000001 I get a fail on the fourth test, as expected.
- repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think
Actually, from the code in lines:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
I am inclined to believe you. However, in view of the above test ... Might it be that the conditions on L1529 have been added quite recently, and did not yet make it to my machine? I'll check this right now....As a matter of fact, my distributions.py contains the same check, i.e., cond1 = (q > 0) & (q < 1) . Hmmm.
Now I admit that I do not understand in all nitty-gritty detail the entire implementation of ppf(), but I suspect that this is a bug.
ppf(0) = self.a ppf(1) = self.b
Good idea.
this already looks correct in the generic ppf code
stats.beta.ppf(0, 0.5) 0.0 stats.beta.a 0.0
Josef
I'll implement the code in my branch, and do a pull request. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Mon, May 14, 2012 at 3:51 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 14, 2012 at 2:45 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I included this. I also think that a check on whether left and right stay within self.a and self.b should be included, perhaps just for safety reasons.
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
Good idea. Just to see what would happen I changed the following code in test_continuous_basic.py:
@_silence_fp_errors def check_cdf_ppf(distfn,arg,msg): values = [-1.e-5, 0.,0.001,0.5,0.999,1.] npt.assert_almost_equal(distfn.cdf(distfn.ppf(values, *arg), *arg), values, decimal=DECIMAL, err_msg= msg + \ ' - cdf-ppf roundtrip')
roundtrip: looks like ppf should be ok, but cdf is not
stats.norm.ppf(-1e-5) nan stats.norm.cdf(np.nan) 0.0 stats.norm.cdf(stats.norm.ppf(-1e-5)) 0.0
I'm using scipy 0.9. but I don't think this has changed, not that I know of
I'm trying to track down when this got changed. (github doesn't show changes in a file that has too many changes, need to dig out git)
It would be better to run the same version as looking at the code. It's difficult to find the bug or understand the behavior if it's not there anymore switching to scipy 0.10
stats.norm.cdf(np.nan) nan scipy.__version__ '0.10.0b2'
nan propagation is not available in 0.9.0 https://github.com/scipy/scipy/commit/96e39ecc6a2b671ed7f99a9c0375adc9238c60... Josef
Thus, I changed the values into an array. It should fail on the first value, as it is negative, but I get a pass. Specifically, I ran:
nicky@chuck:~/prog/scipy/scipy/stats/tests$ python test_continuous_basic.py .............................................................................................................................. ---------------------------------------------------------------------- Ran 126 tests in 93.990s
OK
Weird result. If I add a q = 1.0000001 I get a fail on the fourth test, as expected.
- repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think
Actually, from the code in lines:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
I am inclined to believe you. However, in view of the above test ... Might it be that the conditions on L1529 have been added quite recently, and did not yet make it to my machine? I'll check this right now....As a matter of fact, my distributions.py contains the same check, i.e., cond1 = (q > 0) & (q < 1) . Hmmm.
Now I admit that I do not understand in all nitty-gritty detail the entire implementation of ppf(), but I suspect that this is a bug.
ppf(0) = self.a ppf(1) = self.b
Good idea.
this already looks correct in the generic ppf code
stats.beta.ppf(0, 0.5) 0.0 stats.beta.a 0.0
Josef
I'll implement the code in my branch, and do a pull request. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Yes, you're right. I am trying to use the right version of scipy and stats, but I first have to figure out how to that. Nicky On 14 May 2012 22:15, <josef.pktd@gmail.com> wrote:
On Mon, May 14, 2012 at 3:51 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 14, 2012 at 2:45 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I included this. I also think that a check on whether left and right stay within self.a and self.b should be included, perhaps just for safety reasons.
I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
Good idea. Just to see what would happen I changed the following code in test_continuous_basic.py:
@_silence_fp_errors def check_cdf_ppf(distfn,arg,msg): values = [-1.e-5, 0.,0.001,0.5,0.999,1.] npt.assert_almost_equal(distfn.cdf(distfn.ppf(values, *arg), *arg), values, decimal=DECIMAL, err_msg= msg + \ ' - cdf-ppf roundtrip')
roundtrip: looks like ppf should be ok, but cdf is not
stats.norm.ppf(-1e-5) nan stats.norm.cdf(np.nan) 0.0 stats.norm.cdf(stats.norm.ppf(-1e-5)) 0.0
I'm using scipy 0.9. but I don't think this has changed, not that I know of
I'm trying to track down when this got changed. (github doesn't show changes in a file that has too many changes, need to dig out git)
It would be better to run the same version as looking at the code. It's difficult to find the bug or understand the behavior if it's not there anymore
switching to scipy 0.10
stats.norm.cdf(np.nan) nan scipy.__version__ '0.10.0b2'
nan propagation is not available in 0.9.0
https://github.com/scipy/scipy/commit/96e39ecc6a2b671ed7f99a9c0375adc9238c60...
Josef
Thus, I changed the values into an array. It should fail on the first value, as it is negative, but I get a pass. Specifically, I ran:
nicky@chuck:~/prog/scipy/scipy/stats/tests$ python test_continuous_basic.py .............................................................................................................................. ---------------------------------------------------------------------- Ran 126 tests in 93.990s
OK
Weird result. If I add a q = 1.0000001 I get a fail on the fourth test, as expected.
- repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think
Actually, from the code in lines:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
I am inclined to believe you. However, in view of the above test ... Might it be that the conditions on L1529 have been added quite recently, and did not yet make it to my machine? I'll check this right now....As a matter of fact, my distributions.py contains the same check, i.e., cond1 = (q > 0) & (q < 1) . Hmmm.
Now I admit that I do not understand in all nitty-gritty detail the entire implementation of ppf(), but I suspect that this is a bug.
ppf(0) = self.a ppf(1) = self.b
Good idea.
this already looks correct in the generic ppf code
stats.beta.ppf(0, 0.5) 0.0 stats.beta.a 0.0
Josef
I'll implement the code in my branch, and do a pull request. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Mon, May 14, 2012 at 4:46 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Yes, you're right. I am trying to use the right version of scipy and stats, but I first have to figure out how to that.
It was also directed at myself, I spent half an hour staring at the online code looking for the problem that wasn't there. :) I switch python versions when I want to switch scipy versions. (python 2.5 with scipy 0.7, ...) Josef
Nicky
On 14 May 2012 22:15, <josef.pktd@gmail.com> wrote:
On Mon, May 14, 2012 at 3:51 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 14, 2012 at 2:45 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Nice example. The answer is negative, while it should be positive, but the answer is within numerical accuracy I would say.
oops, didn't we have a case with negative sign already ? maybe a check self.a <= p <= self.b ?
I included this. I also think that a check on whether left and right stay within self.a and self.b should be included, perhaps just for safety reasons.
> I don't see anything yet to criticize in your latest version :(
Ok. I just checked the tests in scipy/stats/tests.
If you are curious, you could temporarily go closer to q=0 and q=1 in the tests for ppf, and see whether it breaks for any distribution.
Good idea. Just to see what would happen I changed the following code in test_continuous_basic.py:
@_silence_fp_errors def check_cdf_ppf(distfn,arg,msg): values = [-1.e-5, 0.,0.001,0.5,0.999,1.] npt.assert_almost_equal(distfn.cdf(distfn.ppf(values, *arg), *arg), values, decimal=DECIMAL, err_msg= msg + \ ' - cdf-ppf roundtrip')
roundtrip: looks like ppf should be ok, but cdf is not
stats.norm.ppf(-1e-5) nan stats.norm.cdf(np.nan) 0.0 stats.norm.cdf(stats.norm.ppf(-1e-5)) 0.0
I'm using scipy 0.9. but I don't think this has changed, not that I know of
I'm trying to track down when this got changed. (github doesn't show changes in a file that has too many changes, need to dig out git)
It would be better to run the same version as looking at the code. It's difficult to find the bug or understand the behavior if it's not there anymore
switching to scipy 0.10
stats.norm.cdf(np.nan) nan scipy.__version__ '0.10.0b2'
nan propagation is not available in 0.9.0
https://github.com/scipy/scipy/commit/96e39ecc6a2b671ed7f99a9c0375adc9238c60...
Josef
Thus, I changed the values into an array. It should fail on the first value, as it is negative, but I get a pass. Specifically, I ran:
nicky@chuck:~/prog/scipy/scipy/stats/tests$ python test_continuous_basic.py .............................................................................................................................. ---------------------------------------------------------------------- Ran 126 tests in 93.990s
OK
Weird result. If I add a q = 1.0000001 I get a fail on the fourth test, as expected.
- repair for the cases q = 0 and q = 1 by means of an explicit test.
isn't ppf (generic part) taking care of this, if not then it should, I think
Actually, from the code in lines:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
I am inclined to believe you. However, in view of the above test ... Might it be that the conditions on L1529 have been added quite recently, and did not yet make it to my machine? I'll check this right now....As a matter of fact, my distributions.py contains the same check, i.e., cond1 = (q > 0) & (q < 1) . Hmmm.
Now I admit that I do not understand in all nitty-gritty detail the entire implementation of ppf(), but I suspect that this is a bug.
ppf(0) = self.a ppf(1) = self.b
Good idea.
this already looks correct in the generic ppf code
stats.beta.ppf(0, 0.5) 0.0 stats.beta.a 0.0
Josef
I'll implement the code in my branch, and do a pull request. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
As you might have seen in another set of mails, I am up and running again: nicky@chuck:~$ python Python 2.7.3 (default, Apr 20 2012, 22:39:59) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import scipy scipy.__version__
'0.11.0.dev-85c0992'
I switch python versions when I want to switch scipy versions. (python 2.5 with scipy 0.7, ...)
I don't know how to do this. Let's see how I fare with the approach of pulling scipy, compiling, and implementing changes. Tomorrow I'll try to add the ppf code. I plan to change these lines: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L118... In case this is wrong, please let me know.
3) I prefer your expanding the search to raising an exception to the user. Note also that your 3) is inconsistent with 1). If a user visible exception is raised, then the user needs to change xa or xb, so it shouldn't be private. That's the current situation (except for a more cryptic message).
2) I'm all in favor, especially for one-side bound distributions, where it should be easy to go through those. There might be a few where the bound moves with the shape, but the only one I remember is genextreme and that has an explicit _ppf
So I would prefer 1), 2) and your new enhanced generic _ppf
Ok. I am convinced now. I'll try to write this in a good and generic way.
participants (2)
-
josef.pktd@gmail.com
-
nicky van foreest