a summary function to get a quick glimpse on the contents of a numpy array
Dear numpy devs and interested readers, as a daytoday user, it occurred to me that having a quick look into the contents and extents of arrays is well doable with numpy. numpy offers a rich set of methods for this. However, very often I oversee myself and others that one just wants to see if the values of an array have a certain min/max or mean or how wide the range of values are. I hence sat down to write a summary function that returns a string of handpacked summary statistics for a quick inspection. I propose to include it into numpy and would love to have your feedback on this idea before I submit a PR. Here is the core functionality: Examples  >>> a = np.random.normal(size=20) >>> print(summary(a)) min 25perc mean stdev median 75perc max 2.289870 2.265757 0.083213 1.115033 0.162885 2.217532 1.639802 >>> a = np.reshape(a, newshape=(4,5)) >>> print(summary(a,axis=1)) min 25perc mean stdev median 75perc max 0 0.976279 0.974090 0.293003 1.009383 0.466814 0.969712 1.519695 1 0.468854 0.467739 0.184139 0.649378 0.036762 0.465510 1.303144 2 2.289870 2.276455 0.324450 1.230031 0.289008 2.249625 1.111107 3 1.782239 1.777304 0.485546 1.259598 1.236190 1.767434 1.639802 So you see, it is merely a tiny helper function that can aid practitioners and data scientists to get a quick insight on what an array contains. first off, here is the code: https://github.com/psteinb/numpy/blob/summaryfunction/numpy/lib/utils.py#L1... I put it there as I am not sure at this point, if the community would appreciate such a function or not. Judging from the tests, lib/utils.py appears to a be place for undocumented functions. So to resolve this and prepare a proper PR, please let me know where this summary function could reside! Second, please give me your thoughts on the summary function's output? Should the number of digits be configurable? Should the columns be configurable? Is is ok to honor the axis parameter which is found in so many numpy functions? Last but not least, let me stress that this is my first time contribution to numpy. I love the library and would like to contribute something back. So bear with me, if my code violates best practices in your community for now. I'll bite my teeth into the formalities of a github PR once I get support from the community and the core devs. I think that a summary function would be a valuable addition to numpy! Best, Peter
On Fri, Jul 31, 2020 at 1:40 PM Peter Steinbach <p.steinbach@hzdr.de> wrote:
Dear numpy devs and interested readers,
as a daytoday user, it occurred to me that having a quick look into the contents and extents of arrays is well doable with numpy. numpy offers a rich set of methods for this. However, very often I oversee myself and others that one just wants to see if the values of an array have a certain min/max or mean or how wide the range of values are.
I hence sat down to write a summary function that returns a string of handpacked summary statistics for a quick inspection. I propose to include it into numpy and would love to have your feedback on this idea before I submit a PR. Here is the core functionality:
Examples  >>> a = np.random.normal(size=20) >>> print(summary(a)) min 25perc mean stdev median 75perc max 2.289870 2.265757 0.083213 1.115033 0.162885 2.217532 1.639802 >>> a = np.reshape(a, newshape=(4,5)) >>> print(summary(a,axis=1)) min 25perc mean stdev median 75perc max 0 0.976279 0.974090 0.293003 1.009383 0.466814 0.969712 1.519695 1 0.468854 0.467739 0.184139 0.649378 0.036762 0.465510 1.303144 2 2.289870 2.276455 0.324450 1.230031 0.289008 2.249625 1.111107 3 1.782239 1.777304 0.485546 1.259598 1.236190 1.767434 1.639802
So you see, it is merely a tiny helper function that can aid practitioners and data scientists to get a quick insight on what an array contains.
first off, here is the code:
https://github.com/psteinb/numpy/blob/summaryfunction/numpy/lib/utils.py#L1...
I put it there as I am not sure at this point, if the community would appreciate such a function or not. Judging from the tests, lib/utils.py appears to a be place for undocumented functions. So to resolve this and prepare a proper PR, please let me know where this summary function could reside!
This seems to be more the domain of scipy.stats and statsmodels. Statsmodels already does a good job with this; in SciPy there's stats.describe ( https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.describe.ht...) which is quite similar to what you're proposing. Could you think about whether scipy.stats.describe does what you want, and if there's room to improve it (perhaps add a `__repr__` and/or a `__html_repr__` for prettyprinting)? Cheers, Ralf
Second, please give me your thoughts on the summary function's output? Should the number of digits be configurable? Should the columns be configurable? Is is ok to honor the axis parameter which is found in so many numpy functions?
Last but not least, let me stress that this is my first time contribution to numpy. I love the library and would like to contribute something back. So bear with me, if my code violates best practices in your community for now. I'll bite my teeth into the formalities of a github PR once I get support from the community and the core devs.
I think that a summary function would be a valuable addition to numpy! Best, Peter
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
participants (2)

Peter Steinbach

Ralf Gommers