mean_std function returning both mean and std
I created a solution for ENH: Computing std/var and mean at the same time, issue #23741. The solution can be found here: https://github.com/soundappraisal/numpy/tree/stdmeandev001
I still need to add tests and the solution does touch the implementation of var. But before starting a pull request I like to check whether mean_std is a welcome addition.
Also I was struggling with the internally needed format of the arrays containing mean and std and the format produced as a result. Which makes me uncertain whether the chosen solution is correct for all cases.
On Wed, 31 May 2023 19:15:12 0000 "Ronald van Elburg" r.a.j.van.elburg@hetnet.nl wrote:
I created a solution for ENH: Computing std/var and mean at the same time, issue #23741. The solution can be found here: https://github.com/soundappraisal/numpy/tree/stdmeandev001
I like the idea ... so often one calculates both of them and the second needs the first.
Steps to make this complete:  move resize of the mean array out of _mean_var and into the calling mean_std function (to reduce the impact of the code changes on existing functions)  establish whether numpy/core/_add_newdocs.py needs to be updated (What is the function of this file?)  add tests at numpy/core/tests/test_numeric.py  add tests that establish whether the specified out matrix returns as output (It is easy to make mistakes and introduce changes which are not in place.)
Should we add mean_var aswell?
Mean_var, mean_std and tests are now ready. (https://github.com/soundappraisal/numpy/tree/stdmeandev001)
Some decisions made during implementation:  the output shape of mean follows the output shape of the variance or the standard deviation. So it responds in the same way to the keepdims flag as the variance and the standard deviation.  the resizing of the mean is placed in _mean_var the overhead on the old functions std and var is minimal as they set mean_out to None.  the intermediate mean used can not be replaced with the mean produced by _mean as the output of the latter can not be broadcast to the incoming data.
On 2/6/23 13:09, Ronald van Elburg wrote:
Mean_var, mean_std and tests are now ready. (https://github.com/soundappraisal/numpy/tree/stdmeandev001)
Some decisions made during implementation:
 the output shape of mean follows the output shape of the variance or the standard deviation. So it responds in the same way to the keepdims flag as the variance and the standard deviation.
 the resizing of the mean is placed in _mean_var the overhead on the old functions std and var is minimal as they set mean_out to None.
 the intermediate mean used can not be replaced with the mean produced by _mean as the output of the latter can not be broadcast to the incoming data.
NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: matti.picus@gmail.com
For a previous discussion of a performant solution, see https://github.com/numpy/numpy/issues/13199. That issue is more about var but also touches on a paper that has a twopass solution for calculating mean and var together
Matti
On 2/6/23 13:41, Matti Picus wrote:
On 2/6/23 13:09, Ronald van Elburg wrote:
Mean_var, mean_std and tests are now ready. (https://github.com/soundappraisal/numpy/tree/stdmeandev001)
Some decisions made during implementation:  the output shape of mean follows the output shape of the variance or the standard deviation. So it responds in the same way to the keepdims flag as the variance and the standard deviation.  the resizing of the mean is placed in _mean_var the overhead on the old functions std and var is minimal as they set mean_out to None.  the intermediate mean used can not be replaced with the mean produced by _mean as the output of the latter can not be broadcast to the incoming data. _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: matti.picus@gmail.com
For a previous discussion of a performant solution, see https://github.com/numpy/numpy/issues/13199. That issue is more about var but also touches on a paper that has a twopass solution for calculating mean and var together
Matti
Ahh, I see that issue is mentioned in your issue https://github.com/numpy/numpy/issues/23741
Matti
I think I left those aspects of the implementation untouched. But having someone more experienced look at it is probably a good idea.
Aha, the unnecessary copy mentioned in the https://dbs.ifi.uniheidelberg.de/files/Team/eschubert/publications/SSDBM18.... paper is a copy of the input. Here it is about discarding a valuable output (the mean) and then calculating that result separately.
Not throwing the mean away saves about 20% computation time. Or phrased differently the calculation of the variance spends about a 25% of the computation time on calculating the mean.
On Fri, 02 Jun 2023 11:47:14 0000 "Ronald van Elburg" r.a.j.van.elburg@hetnet.nl wrote:
Aha, the unnecessary copy mentioned in the https://dbs.ifi.uniheidelberg.de/files/Team/eschubert/publications/SSDBM18.... paper is a copy of the input. Here it is about discarding a valuable output (the mean) and then calculating that result separately.
I have been working a lot around this publication and I found it very interesting. Nevertheless, I believe there is a bug when dealing with weighted averages (eq22) ... but we can discuss offline about it. None of the author did answer to my comments.
Since the PR is about unweighted means/std, the math exposed there are (very likely) correct.
Cheers,
I had a closer look at the paper. When I have more brain and time I may check the mathematics. The focus is however more on streaming data, which is an application with completely different demands. I think that here we can not afford to sample the data, which is an option in streaming database systems.
On Fri, 02 Jun 2023 21:42:51 0000 "Ronald van Elburg" r.a.j.van.elburg@hetnet.nl wrote:
I had a closer look at the paper. When I have more brain and time I may check the mathematics. The focus is however more on streaming data, which is an application with completely different demands. I think that here we can not afford to sample the data, which is an option in streaming database systems.
To be more precise, the "bug" I spotted is not in the math, per se, but in some the definitions which prevent some simplifaction later on ...
In table 1, I believe the weights should be squared when calculating (co)variance VWp and one should normalize with the sum of omega squared (Eq 13) ... later on the simplication leading to Eq 21 and 22 no not take take place anylonger :(
The demonstration for the provided formula to be wrong is simple: multiply all weights by any (very large or very small) number modifies much the variance and it should not.
Cheers,
Jerome
On Fri, Jun 2, 2023 at 1:51 PM Ronald van Elburg r.a.j.van.elburg@hetnet.nl wrote:
Aha, the unnecessary copy mentioned in the https://dbs.ifi.uniheidelberg.de/files/Team/eschubert/publications/SSDBM18.... paper is a copy of the input. Here it is about discarding a valuable output (the mean) and then calculating that result separately.
Not throwing the mean away saves about 20% computation time. Or phrased differently the calculation of the variance spends about a 25% of the computation time on calculating the mean.
I'm not sure I find that a compelling benefit for introducing dedicated functions. The work on moving the implementation to C showed >2x speedups ( https://github.com/numpy/numpy/issues/13199#issuecomment478305730). Is there a reason for not trying to get that much larger gain merged first or instead?
Cheers, Ralf
I am agnostic to the order of those changes. Also this is my first attempt to contribute to numpy, so I am not aware of all the ongoing discussions. I'll try to read the issue you just mentioned.
But in the code I rewrote replacing _mean_var with a faster version would benefit var, std, mean_var and mean_std because they all call _mean_var.
The mean function is untouched.
I had a look at Csolution, it delegates the summation over one axis from the axis tuple to the Chelper. And then the remaining axes are summed from _methods.py. Worst case: if the axis delegated to helper is very short compared to the other axes I would expect hardly any speedup, and savings on memory usage would also be limited.
Sticking with this solution it would be a better from the point of view of speed and memory use to delegate the longest axis from the axis tuple to Ccode.
In my view a solution with which many would be happier (https://github.com/numpy/numpy/pull/13263#issuecomment1048122467) would probably delegate all the axes to the helper function.
I had a look at what it would take to improve the Csolution. However I find that it is beyond my Cprogramming skils.
The gufunc defintion seems to be at odds with current working of the axis keyword for mean, std and var. The latter support computation over multiple axes, whereas the gufunc only seems to support calculation over a single axis. As the behaviour of std, mean and var is largely inherited from ufuncs those might offer a better starting point.
If the operator used in the ufunc could take a parameter from the outer_loop accessing in this case the mean, then it would be possible to calculate the required intermediate quantities. This should be a possibility as somewhere the out array is also accessed in the correct manner and we should step through both arrays in the same way.
Instead of:
'''Pseudocode result = np.full(result_shape, op.identity) # op = ufunc
loop_outer_axes_result_array: loop_over_inner_axes_input_array: result[outer_axes] = op(result[outer_axes], InArray[outer_axes + inner_axes]) '''
we would then get:
'''Pseudocode result = np.full(result_shape, op.identity) # op = ufunc
loop_outer_axes_result_array: loop_over_inner_axes_input_array: result[outer_axes] = op(result[outer_axes], InArray[outer_axes + inner_axes], ParameterArray[outer_axes]) '''
Using for op:
'''Pseudocode op(a,b,c) = a+bc '''
and for b the original data and for c the mean (M_1) you would obtain the Neely correction for the mean.
Similarly using:
'''Pseudocode op(a,b,c) = a+(bc)^2 ''' you would obtain the sum of squared errors.
Note: the suggested solution requires no allocation of memory beyond that needed for storing the result.
I have a pull request, but I am stuck for a day now on how to handle the masked arrays.
I made some progress by calling the MaskedArray methods, but in some cases those methods call back the ndarray methods via their super class. The method _mean_var for ndarray need to resize the produced mean to align the shape of the mean and variance or standard deviation, but if the incoming and therefore the outgoing object is a MaskedArray that is not allowed.
Also I sometimes see some uppredictable behavior which gives me the feeling I am looking at pointer problems.
python runtests.py t numpy/core/tests/test_numeric.py passes now
python runtests.py t numpy/ma/tests/ is fialing with weird erros on complex masked arrays, particularly: test_varstd test_complex
Issue #23896 is the cause of these two failing tests.
With CFLAGS="NPY_DISABLE_OPTIMIZATION=1" the tests pass.
Second attempt after the triage review of last week: ENH: add mean keyword to std and var #24126 (https://github.com/numpy/numpy/pull/24126)
On a somewhat related note, I usually find I need to compute stats incrementally. To do this, a stat object is created so batches of samples can be fed to it sequentially.
I used to use an implementation based on boost::accumulator for this. More recently I'm using my own c++ code based on xtensor, exposed to python with xtensorpython and pybind11.
The basic technique to find 2nd order stats is to keep 2 running sums, sum(x) and sum(x**2).
It would be useful to have functionality for incremental stats like this in numpy, as well as other incremental operations (e.g., histogram). I frequently find I need to process large amounts of data in small batches at a time, generated by iterative montecarlo simulations, for example.
participants (5)

Jerome Kieffer

Matti Picus

Neal Becker

Ralf Gommers

Ronald van Elburg