
All, I'm puzzled by this result:
from scipy.stats.distributions import nbinom nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))) array([ 1., 2., 2., 3., 5., 5., 6., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
I would have naturally expected np.arange(20). Using np.round instead of np.ceil in nbinom_gen._ppf seems to solve the issue. Is there any reason not to do it ? Thx in advance. P.

On Mon, Jul 27, 2009 at 14:47, Pierre GM<pgmdevlist@gmail.com> wrote:
All, I'm puzzled by this result: >>> from scipy.stats.distributions import nbinom >>> nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))) array([ 1., 2., 2., 3., 5., 5., 6., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
I would have naturally expected np.arange(20).
Floating point shenanigans. The CDF and PPF of discrete distributions are step functions. Round-tripping involves evaluating the PPF around the step. Naturally, floating point errors are going to nudge you to the left or right of that step.
Using np.round instead of np.ceil in nbinom_gen._ppf seems to solve the issue. Is there any reason not to do it ?
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 14:47, Pierre GM<pgmdevlist@gmail.com> wrote:
All, I'm puzzled by this result: >>> from scipy.stats.distributions import nbinom >>> nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))) array([ 1., 2., 2., 3., 5., 5., 6., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
I would have naturally expected np.arange(20).
Floating point shenanigans. The CDF and PPF of discrete distributions are step functions. Round-tripping involves evaluating the PPF around the step. Naturally, floating point errors are going to nudge you to the left or right of that step.
Using np.round instead of np.ceil in nbinom_gen._ppf seems to solve the issue. Is there any reason not to do it ?
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location. I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer. Josef
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On Mon, Jul 27, 2009 at 15:29, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 14:47, Pierre GM<pgmdevlist@gmail.com> wrote:
All, I'm puzzled by this result: >>> from scipy.stats.distributions import nbinom >>> nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))) array([ 1., 2., 2., 3., 5., 5., 6., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
I would have naturally expected np.arange(20).
Floating point shenanigans. The CDF and PPF of discrete distributions are step functions. Round-tripping involves evaluating the PPF around the step. Naturally, floating point errors are going to nudge you to the left or right of that step.
Using np.round instead of np.ceil in nbinom_gen._ppf seems to solve the issue. Is there any reason not to do it ?
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location.
I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer.
I think the code after the first line is an attempt to remedy this situation: def _ppf(self, q, n, pr): vals = ceil(special.nbdtrik(q,n,pr)) vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) return where(temp >= q, vals1, vals) It is possible that "temp >= q" should be replaced with "temp >= q-eps". -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Jul 27, 2009, at 4:58 PM, Robert Kern wrote:
On Mon, Jul 27, 2009 at 15:29, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Fair enough, but a bit frustrating nevertheless in this particular case.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location.
I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer.
I think the code after the first line is an attempt to remedy this situation:
def _ppf(self, q, n, pr): vals = ceil(special.nbdtrik(q,n,pr)) vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
It is possible that "temp >= q" should be replaced with "temp >= q- eps".
Doesn't work in the case I was presenting, as temp is here an array of NaNs. Using
vals = ceil(special.nbdtrik(q,n,pr)-eps) seems to work well enough, provided that eps is around 1e-8 as Josef suggested.

On Mon, Jul 27, 2009 at 6:20 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
On Jul 27, 2009, at 4:58 PM, Robert Kern wrote:
On Mon, Jul 27, 2009 at 15:29, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Fair enough, but a bit frustrating nevertheless in this particular case.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location.
I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer.
I think the code after the first line is an attempt to remedy this situation:
def _ppf(self, q, n, pr): vals = ceil(special.nbdtrik(q,n,pr)) vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
It is possible that "temp >= q" should be replaced with "temp >= q- eps".
Doesn't work in the case I was presenting, as temp is here an array of NaNs. Using >>> vals = ceil(special.nbdtrik(q,n,pr)-eps) seems to work well enough, provided that eps is around 1e-8 as Josef suggested.
I finally remembered, that in the test for the discrete distribution or for histogram with integers, I always use the 1e-8 correction, or 1e-14, depending on what is the more likely numerical error in the calculation. I'm usually not sure what the appropriate correction is. special.nbdtrik doesn't have any documentation, so I don't even know what it is supposed to do. The original failure case should make a nice roundtrip test for the discrete distribution, after the correction. I think, until now I only used floating point cases that are usually non-integers in the tests of the discrete distribution, but not exact roundtrip tests as for the continuous distributions. I will test and change this as soon as possible. Josef
_______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On Mon, Jul 27, 2009 at 18:10, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 6:20 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
On Jul 27, 2009, at 4:58 PM, Robert Kern wrote:
On Mon, Jul 27, 2009 at 15:29, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Fair enough, but a bit frustrating nevertheless in this particular case.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location.
I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer.
I think the code after the first line is an attempt to remedy this situation:
def _ppf(self, q, n, pr): vals = ceil(special.nbdtrik(q,n,pr)) vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
It is possible that "temp >= q" should be replaced with "temp >= q- eps".
Doesn't work in the case I was presenting, as temp is here an array of NaNs. Using >>> vals = ceil(special.nbdtrik(q,n,pr)-eps) seems to work well enough, provided that eps is around 1e-8 as Josef suggested.
I finally remembered, that in the test for the discrete distribution or for histogram with integers, I always use the 1e-8 correction, or 1e-14, depending on what is the more likely numerical error in the calculation. I'm usually not sure what the appropriate correction is.
special.nbdtrik doesn't have any documentation, so I don't even know what it is supposed to do.
Just grep for it in the scipy/special sources. It's the inverse of nbdtr, hence the "i", inverting for the "k" parameter given "n" and "p". The problem here is that we are using nbdtr() inside this function to compute temp, but nbdtr doesn't handle n<1. The _cdf() method uses betainc() instead. If one replaces these lines: vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) with these: vals1 = (vals-1).clip(0.0, np.inf) temp = special.betainc(n,vals,pr) then you get the desired answer. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Mon, Jul 27, 2009 at 7:21 PM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 18:10, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 6:20 PM, Pierre GM<pgmdevlist@gmail.com> wrote:
On Jul 27, 2009, at 4:58 PM, Robert Kern wrote:
On Mon, Jul 27, 2009 at 15:29, <josef.pktd@gmail.com> wrote:
On Mon, Jul 27, 2009 at 3:55 PM, Robert Kern<robert.kern@gmail.com> wrote:
ceil() is correct; round() is not. round() would be okay if the only inputs are expected to be outputs of the CDF, but one frequently needs the PPF to take all values in [0,1], like for example doing random number generation via inversion.
Fair enough, but a bit frustrating nevertheless in this particular case.
Do you think it would be useful to round if we are epsilon (?) close to the next integer? It is more likely that users have a case like Pierre's where the answer might be the closest integer, instead of an epsilon below. It would move the floating point error, but to an, in actual usage, less likely location.
I think, it is in scipy.special where I saw some code that treats anything that is within 1e-8 (?) of an integer as the integer.
I think the code after the first line is an attempt to remedy this situation:
def _ppf(self, q, n, pr): vals = ceil(special.nbdtrik(q,n,pr)) vals1 = vals-1 temp = special.nbdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
It is possible that "temp >= q" should be replaced with "temp >= q- eps".
Doesn't work in the case I was presenting, as temp is here an array of NaNs. Using >>> vals = ceil(special.nbdtrik(q,n,pr)-eps) seems to work well enough, provided that eps is around 1e-8 as Josef suggested.
I finally remembered, that in the test for the discrete distribution or for histogram with integers, I always use the 1e-8 correction, or 1e-14, depending on what is the more likely numerical error in the calculation. I'm usually not sure what the appropriate correction is.
special.nbdtrik doesn't have any documentation, so I don't even know what it is supposed to do.
Just grep for it in the scipy/special sources. It's the inverse of nbdtr, hence the "i", inverting for the "k" parameter given "n" and "p".
The problem here is that we are using nbdtr() inside this function to compute temp, but nbdtr doesn't handle n<1. The _cdf() method uses betainc() instead. If one replaces these lines:
vals1 = vals-1 temp = special.nbdtr(vals1,n,pr)
with these:
vals1 = (vals-1).clip(0.0, np.inf) temp = special.betainc(n,vals,pr)
then you get the desired answer.
That's better. It took me a while to understand the logic behind the way the ceiling error is corrected. The same pattern is also followed by the other discrete distributions that define a _ppf method. It is cleaner then the epsilon correction, but takes longer to figure out what it does. To understand the logic more easily and to be DRY, it would be better to replace the duplication of the _cdf method directly with a call to self._cdf. For example, in changeset 4673, Robert, you changed the _cdf method to use betainc instead of nbdtr, but not the _ppf method. Without the code duplication, partial corrections could be more easily avoided. Is there a reason not to call self._cdf instead? A temporary workaround would be to add, in this case, at least 1e-10 to the cdf in the roundtrip to avoid the floating point ceiling error.
nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))+1e-8) array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.]) nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))+1e-10) array([ 1., 2., 3., 4., 5., 5., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
nbinom(.3,.15).ppf(nbinom(.3,.15).cdf(np.arange(20))+1e-11) array([ 1., 2., 2., 4., 5., 5., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
Josef
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On Mon, Jul 27, 2009 at 20:11, <josef.pktd@gmail.com> wrote:
That's better. It took me a while to understand the logic behind the way the ceiling error is corrected. The same pattern is also followed by the other discrete distributions that define a _ppf method. It is cleaner then the epsilon correction, but takes longer to figure out what it does.
To understand the logic more easily and to be DRY, it would be better to replace the duplication of the _cdf method directly with a call to self._cdf. For example, in changeset 4673, Robert, you changed the _cdf method to use betainc instead of nbdtr, but not the _ppf method. Without the code duplication, partial corrections could be more easily avoided.
Is there a reason not to call self._cdf instead?
Nope. Go ahead. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Tue, Jul 28, 2009 at 12:56 PM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 20:11, <josef.pktd@gmail.com> wrote:
That's better. It took me a while to understand the logic behind the way the ceiling error is corrected. The same pattern is also followed by the other discrete distributions that define a _ppf method. It is cleaner then the epsilon correction, but takes longer to figure out what it does.
To understand the logic more easily and to be DRY, it would be better to replace the duplication of the _cdf method directly with a call to self._cdf. For example, in changeset 4673, Robert, you changed the _cdf method to use betainc instead of nbdtr, but not the _ppf method. Without the code duplication, partial corrections could be more easily avoided.
Is there a reason not to call self._cdf instead?
Nope. Go ahead.
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
for the record: dlaplace and boltzman also fail the roundtrip test boltzmann (1.3999999999999999, 19) [ 1. 2. 3. 3.] [ 0. 1. 2. 3.] False True False [ 0. 3.] dlaplace (0.80000000000000004,) [-5. -4. -3. -2. -1. 1. 1. 2. 3. 4. 5.] [-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.] Josef

On Sat, Aug 1, 2009 at 11:29 AM, <josef.pktd@gmail.com> wrote:
On Tue, Jul 28, 2009 at 12:56 PM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 20:11, <josef.pktd@gmail.com> wrote:
That's better. It took me a while to understand the logic behind the way the ceiling error is corrected. The same pattern is also followed by the other discrete distributions that define a _ppf method. It is cleaner then the epsilon correction, but takes longer to figure out what it does.
To understand the logic more easily and to be DRY, it would be better to replace the duplication of the _cdf method directly with a call to self._cdf. For example, in changeset 4673, Robert, you changed the _cdf method to use betainc instead of nbdtr, but not the _ppf method. Without the code duplication, partial corrections could be more easily avoided.
Is there a reason not to call self._cdf instead?
Nope. Go ahead.
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
for the record: dlaplace and boltzman also fail the roundtrip test
boltzmann (1.3999999999999999, 19) [ 1. 2. 3. 3.] [ 0. 1. 2. 3.] False True False [ 0. 3.] dlaplace (0.80000000000000004,) [-5. -4. -3. -2. -1. 1. 1. 2. 3. 4. 5.] [-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]
Josef
and planck
stats.planck.ppf(stats.planck.cdf(np.arange(10),.51),.51) array([ 1., 1., 2., 3., 4., 5., 7., 7., 8., 10.])
fixed in 5889 Josef
participants (3)
-
josef.pktd@gmail.com
-
Pierre GM
-
Robert Kern