scipy.stats: some questions/points about distributions.py + reply on ticket 1493
Hi Josef, Sorry for not responding earlier... too many obligations. Before I get back to your earlier mail, I have some naive questions about distributions.py. I hope you don't mind that I fire them at you. 1: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L436 Is this code "dead"? Within distributions.py it is not called. Nearly the same code is written here: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L118... 2: I have a similar point about: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L358 What is the use of this code? It is not called anywhere. Besides this, from our discussion about ticket 1493, this function returns the centralized moments, while the "real" moment E(X^n) should be returned. Hence, the code is also not correct, i.e., not in line with the documentation. 3: Suppose we would turn xa and xb into private atrributes _xa and _xb, then i suppose that https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883 requires updating. 4: I have a hard time understanding the working (and goal) of https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883 Where is the right place to ask for some clarification? Or should I just think harder? 5: The definition of arr in https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L60 does not add much (although it saves some characters at some points of the code), but makes it harder to read the code for novices like me. (I spent some time searching for a numpy function called arr, only to find out later that it was just a shorthand only used in the distribution.py module). Would it be a problem to replace such code by the proper numpy function? 6: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538 contains a typo. It should be Weisstein. 7: https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625 This code gives me even a harder time than _argsreduce. I have to admit that I simply don't know what this code is trying to prevent/check/repair. Would you mind giving a hint? Nicky
On Wed, Apr 25, 2012 at 2:12 PM, nicky van foreest <vanforeest@gmail.com> wrote:
Hi Josef,
Sorry for not responding earlier... too many obligations.
Before I get back to your earlier mail, I have some naive questions about distributions.py. I hope you don't mind that I fire them at you.
1:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L436
I never looked at this. It's not used anywhere.
Is this code "dead"? Within distributions.py it is not called. Nearly the same code is written here:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L118...
This is what is used for the generic ppf.
2:
I have a similar point about:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L358
What is the use of this code? It is not called anywhere. Besides this, from our discussion about ticket 1493, this function returns the centralized moments, while the "real" moment E(X^n) should be returned. Hence, the code is also not correct, i.e., not in line with the documentation.
I think this and skew, kurtosis are internal functions for fit_start, getting starting values for fit from the data, even if it's not used. in general: For the calculations it might sometimes be nicer to calculate central moments, and then convert them to non-central or the other way around. I have some helper functions for this in statsmodels and it is similarly used https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L174... (That's new code that I'm not so familiar with.)
3:
Suppose we would turn xa and xb into private atrributes _xa and _xb, then i suppose that
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883
requires updating.
Yes, but no big loss I think, given that it won't be needed anymore
4:
I have a hard time understanding the working (and goal) of
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883
This ? xb : float, optional Upper bound for fixed point calculation for generic ppf.
Where is the right place to ask for some clarification? Or should I just think harder?
5:
The definition of arr in
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L60
does not add much (although it saves some characters at some points of the code), but makes it harder to read the code for novices like me. (I spent some time searching for a numpy function called arr, only to find out later that it was just a shorthand only used in the distribution.py module). Would it be a problem to replace such code by the proper numpy function?
But then these novices would just read some piece code instead of going through all 7000 lines looking for imports and redefinitions. And I suffered the same way. :) I don't have any problem with cleaning this up. I never checked if in some cases with lot's of generic loops the namespace lookup would significantly increase the runtime.
6:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538
contains a typo. It should be Weisstein.
should be fixed then
7:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
This code gives me even a harder time than _argsreduce. I have to admit that I simply don't know what this code is trying to prevent/check/repair. Would you mind giving a hint?
whats _argsreduce? https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625 This has been rewritten by Per Brodtkorb. It is used in most methods to get the goodargs with which the distribution specific method is called. example ppf https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152... first we are building the conditions for valid, good arguments. boundaries are filled, invalid arguments get nans. What's left over are the goodargs, the values of the method arguments for which we need to calculate the actual results. So we need to broadcast and select those arguments. -> argsreduce The distribution specific or generic ._ppf is then called with 1d arrays (of the same shape IIRC) of goodargs. then we can "place" the calculated values into the results arrays, next to the nans and boundaries. I hope that helps Thanks, Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
1:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L436
I never looked at this. It's not used anywhere.
Is this code "dead"? Within distributions.py it is not called. Nearly the same code is written here:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L118...
This is what is used for the generic ppf.
Yes, sure. Sorry for confusing you. L1180 makes good sense. But since L1180 is there, there appears to be no good reason to include the code at L436.
2:
I have a similar point about:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L358
What is the use of this code? It is not called anywhere. Besides this, from our discussion about ticket 1493, this function returns the centralized moments, while the "real" moment E(X^n) should be returned. Hence, the code is also not correct, i.e., not in line with the documentation.
I think this and skew, kurtosis are internal functions for fit_start, getting starting values for fit from the data, even if it's not used. in general: For the calculations it might sometimes be nicer to calculate central moments, and then convert them to non-central or the other way around. I have some helper functions for this in statsmodels and it is similarly used
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L174...
(That's new code that I'm not so familiar with.)
I actually saw this code, and have my doubts about whether this is the best way to compute the non-central moments. Suppose that the computation of the central moment involves quad(). Then indeed the computations at these lines don't require a new call to quad(). However, there is a (slow) python for loop involved, the power function ** is called multiple times, and { n \choose k} is computed. (BTW, can I safely assume you use Latex?). Calling quad() on x**k to compute E(X^k) might be just a fast, although I did not test this hunch. Anyway quad( lamdba x: x**k *_pdf(x)) reads much easier.
3:
Suppose we would turn xa and xb into private atrributes _xa and _xb, then i suppose that
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883
requires updating.
Yes, but no big loss I think, given that it won't be needed anymore
Oops. Your other mail convinced to do use _xa and _xb.... See my other mail.
5:
The definition of arr in
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L60
does not add much (although it saves some characters at some points of the code), but makes it harder to read the code for novices like me. (I spent some time searching for a numpy function called arr, only to find out later that it was just a shorthand only used in the distribution.py module). Would it be a problem to replace such code by the proper numpy function?
But then these novices would just read some piece code instead of going through all 7000 lines looking for imports and redefinitions. And I suffered the same way. :)
I suppose you did :-)
I don't have any problem with cleaning this up. I never checked if in some cases with lot's of generic loops the namespace lookup would significantly increase the runtime.
Should it? I am not an expert on this, but I read in Langtangen's book that importing functions like so: from numpy import array, and so on, does not add much to the calling time of functions. However, if I am mistaken, please forget this point.
6:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538
contains a typo. It should be Weisstein.
should be fixed then
Should this become a ticket, or is it too minor?
7:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
This code gives me even a harder time than _argsreduce. I have to admit that I simply don't know what this code is trying to prevent/check/repair. Would you mind giving a hint?
whats _argsreduce?
Sorry, I meant __argcheck(). The code at https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L119... is not very simple to understand, at least not for me.
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
This has been rewritten by Per Brodtkorb. It is used in most methods to get the goodargs with which the distribution specific method is called.
example ppf https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
first we are building the conditions for valid, good arguments. boundaries are filled, invalid arguments get nans. What's left over are the goodargs, the values of the method arguments for which we need to calculate the actual results. So we need to broadcast and select those arguments. -> argsreduce The distribution specific or generic ._ppf is then called with 1d arrays (of the same shape IIRC) of goodargs.
then we can "place" the calculated values into the results arrays, next to the nans and boundaries.
I hope that helps
I'll try to understand it again. Thanks for your hints.
Thanks,
Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Wed, Apr 25, 2012 at 4:03 PM, nicky van foreest <vanforeest@gmail.com> wrote:
1:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L436
I never looked at this. It's not used anywhere.
Is this code "dead"? Within distributions.py it is not called. Nearly the same code is written here:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L118...
This is what is used for the generic ppf.
Yes, sure. Sorry for confusing you. L1180 makes good sense. But since L1180 is there, there appears to be no good reason to include the code at L436.
I guess not, but because I never looked at it carefully, I don't know if it might be useful for anything.
2:
I have a similar point about:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L358
What is the use of this code? It is not called anywhere. Besides this, from our discussion about ticket 1493, this function returns the centralized moments, while the "real" moment E(X^n) should be returned. Hence, the code is also not correct, i.e., not in line with the documentation.
I think this and skew, kurtosis are internal functions for fit_start, getting starting values for fit from the data, even if it's not used. in general: For the calculations it might sometimes be nicer to calculate central moments, and then convert them to non-central or the other way around. I have some helper functions for this in statsmodels and it is similarly used
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L174...
(That's new code that I'm not so familiar with.)
I actually saw this code, and have my doubts about whether this is the best way to compute the non-central moments. Suppose that the computation of the central moment involves quad(). Then indeed the computations at these lines don't require a new call to quad(). However, there is a (slow) python for loop involved, the power function ** is called multiple times, and { n \choose k} is computed. (BTW, can I safely assume you use Latex?). Calling quad() on x**k to compute E(X^k) might be just a fast, although I did not test this hunch. Anyway quad( lamdba x: x**k *_pdf(x)) reads much easier.
L1745 are necessary now because moments are still non-central, but allow for non default loc and scale. _munp still does the raw quad( lamdba x: x**k *_pdf(x)) or similar if it is not specifically defined, i.e. can be calculated from explicit _stats. as aside: whenever I try to go through generic _stats and moments I have a hard time to follow in what is going on in all the different cases I don't see where we could save much. I didn't go through the math.
3:
Suppose we would turn xa and xb into private atrributes _xa and _xb, then i suppose that
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883
requires updating.
Yes, but no big loss I think, given that it won't be needed anymore
Oops. Your other mail convinced to do use _xa and _xb.... See my other mail.
not needed as public method, See the other thread which would also need adjustments to the docstring
5:
The definition of arr in
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L60
does not add much (although it saves some characters at some points of the code), but makes it harder to read the code for novices like me. (I spent some time searching for a numpy function called arr, only to find out later that it was just a shorthand only used in the distribution.py module). Would it be a problem to replace such code by the proper numpy function?
But then these novices would just read some piece code instead of going through all 7000 lines looking for imports and redefinitions. And I suffered the same way. :)
I suppose you did :-)
I don't have any problem with cleaning this up. I never checked if in some cases with lot's of generic loops the namespace lookup would significantly increase the runtime.
Should it? I am not an expert on this, but I read in Langtangen's book that importing functions like so: from numpy import array, and so on, does not add much to the calling time of functions. However, if I am mistaken, please forget this point.
from numpy import asarray or from numpy import asarray as arr doesn't make any difference, but I usually like full namespaces np.asarray
6:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538
contains a typo. It should be Weisstein.
should be fixed then
Should this become a ticket, or is it too minor?
too minor for a ticket, but sneaking it into a pull request, or making a separate pull request (to increase your Karma) would be useful
7:
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
This code gives me even a harder time than _argsreduce. I have to admit that I simply don't know what this code is trying to prevent/check/repair. Would you mind giving a hint?
whats _argsreduce?
Sorry, I meant __argcheck(). The code at
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L119...
is not very simple to understand, at least not for me.
AFAICS, It's just a joint condition that all args are strictly positive ( the default condition) args[0] > 0 & args[1] > 0 & .... my guess is the arr, asarray, is not necessary, and should be handled already in the main methods. Josef
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
This has been rewritten by Per Brodtkorb. It is used in most methods to get the goodargs with which the distribution specific method is called.
example ppf https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L152...
first we are building the conditions for valid, good arguments. boundaries are filled, invalid arguments get nans. What's left over are the goodargs, the values of the method arguments for which we need to calculate the actual results. So we need to broadcast and select those arguments. -> argsreduce The distribution specific or generic ._ppf is then called with 1d arrays (of the same shape IIRC) of goodargs.
then we can "place" the calculated values into the results arrays, next to the nans and boundaries.
I hope that helps
I'll try to understand it again.
Thanks for your hints.
Thanks,
Josef
Nicky _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
L1745 are necessary now because moments are still non-central, but allow for non default loc and scale. _munp still does the raw quad( lamdba x: x**k *_pdf(x)) or similar if it is not specifically defined, i.e. can be calculated from explicit _stats.
as aside: whenever I try to go through generic _stats and moments I have a hard time to follow in what is going on in all the different cases
I don't see where we could save much. I didn't go through the math.
Perhaps we should postpone the problems with the moments computations for the moment. I'll come across this point in due time and bug you again about it.
from numpy import asarray or from numpy import asarray as arr doesn't make any difference, but I usually like full namespaces np.asarray
About replacing arr by asarray: do I get it right that you change this? Or should I try making a pull request for this?
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538
contains a typo. It should be Weisstein.
should be fixed then
Should this become a ticket, or is it too minor?
too minor for a ticket, but sneaking it into a pull request, or making a separate pull request (to increase your Karma) would be useful
I just got a local version of scipy and wrote a patch for this. For today I'll try to turn it into a pull request. As this is my first attempt, I have no idea how long it will take to set everything up. The spin off will be, hopefully, that I can add to the code at other places too. I suppose you will see the patch appearing. Nicky
participants (2)
-
josef.pktd@gmail.com -
nicky van foreest