
Hi all, and happy new year! I'm new to NumPy and searching a way to compute from a set of points (x,y) the mean value of y values associated to each distinct x value. Each point corresponds to a measure in a benchmark (x = parameter, y = computation time) and I'd like to plot the graph of mean computation time wrt parameter values. (I know how to plot, but not how to compute mean values.) My points are stored as two arrays X, Y (same size). In pure Python, I'd do as follows: s = {} # sum of y values for each distinct x (as keys) n = {} # number of summed values (same keys) for x, y in zip(X, Y) : s[x] = s.get(x, 0.0) + y n[x] = n.get(x, 0) + 1 new_x = array(list(sorted(s))) new_y = array([s[x]/n[x] for x in sorted(s)]) Unfortunately, this code is much too slow because my arrays have millions of elements. But I'm pretty sure that NumPy offers a way to handle this more elegantly and much faster. As a bonus, I'd be happy if the solution would allow me to compute also standard deviation, min, max, etc. Thanks in advance for any help! Franck

A Tuesday 06 January 2009, Franck Pommereau escrigué:
Hi all, and happy new year!
I'm new to NumPy and searching a way to compute from a set of points (x,y) the mean value of y values associated to each distinct x value. Each point corresponds to a measure in a benchmark (x = parameter, y = computation time) and I'd like to plot the graph of mean computation time wrt parameter values. (I know how to plot, but not how to compute mean values.)
My points are stored as two arrays X, Y (same size). In pure Python, I'd do as follows:
s = {} # sum of y values for each distinct x (as keys) n = {} # number of summed values (same keys) for x, y in zip(X, Y) : s[x] = s.get(x, 0.0) + y n[x] = n.get(x, 0) + 1 new_x = array(list(sorted(s))) new_y = array([s[x]/n[x] for x in sorted(s)])
Unfortunately, this code is much too slow because my arrays have millions of elements. But I'm pretty sure that NumPy offers a way to handle this more elegantly and much faster.
As a bonus, I'd be happy if the solution would allow me to compute also standard deviation, min, max, etc.
The next would do the trick: In [92]: x = np.random.randint(100,size=100) In [93]: y = np.random.rand(100) In [94]: u = np.unique(x) In [95]: means = [ y[x == i].mean() for i in u ] In [96]: stds = [ y[x == i].std() for i in u ] In [97]: maxs = [ y[x == i].max() for i in u ] In [98]: mins = [ y[x == i].min() for i in u ] and your wanted data will be in means, stds, maxs and mins lists. This approach has the drawback that you have to process the array each time that you want to extract the desired info. If what you want is to always retrieve the same set of statistics, you can do this in one single loop: In [99]: means, std, maxs, mins = [], [], [], [] In [100]: for i in u: g = y[x == i] means.append(g.mean()) stds.append(g.std()) maxs.append(g.max()) mins.append(g.min()) .....: which has the same effect than above, but is much faster. Hope that helps, -- Francesc Alted

Francesc Alted wrote:
A Tuesday 06 January 2009, Franck Pommereau escrigué:
Hi all, and happy new year!
I'm new to NumPy and searching a way to compute from a set of points (x,y) the mean value of y values associated to each distinct x value. Each point corresponds to a measure in a benchmark (x = parameter, y = computation time) and I'd like to plot the graph of mean computation time wrt parameter values. (I know how to plot, but not how to compute mean values.)
My points are stored as two arrays X, Y (same size). In pure Python, I'd do as follows:
s = {} # sum of y values for each distinct x (as keys) n = {} # number of summed values (same keys) for x, y in zip(X, Y) : s[x] = s.get(x, 0.0) + y n[x] = n.get(x, 0) + 1 new_x = array(list(sorted(s))) new_y = array([s[x]/n[x] for x in sorted(s)])
Unfortunately, this code is much too slow because my arrays have millions of elements. But I'm pretty sure that NumPy offers a way to handle this more elegantly and much faster.
As a bonus, I'd be happy if the solution would allow me to compute also standard deviation, min, max, etc.
The next would do the trick:
In [92]: x = np.random.randint(100,size=100)
In [93]: y = np.random.rand(100)
In [94]: u = np.unique(x)
In [95]: means = [ y[x == i].mean() for i in u ]
In [96]: stds = [ y[x == i].std() for i in u ]
In [97]: maxs = [ y[x == i].max() for i in u ]
In [98]: mins = [ y[x == i].min() for i in u ]
and your wanted data will be in means, stds, maxs and mins lists. This approach has the drawback that you have to process the array each time that you want to extract the desired info. If what you want is to always retrieve the same set of statistics, you can do this in one single loop:
In [99]: means, std, maxs, mins = [], [], [], []
In [100]: for i in u: g = y[x == i] means.append(g.mean()) stds.append(g.std()) maxs.append(g.max()) mins.append(g.min()) .....:
which has the same effect than above, but is much faster.
Hope that helps,
If you use Knuth's one pass approach (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-lin...) you can write a function to get the min, max, mean and variance/standard deviation in a single pass through the array rather than one pass for each. I do not know if this will provide any advantage as that will probably depend on the size of the arrays. Also, please use the highest precision possible (ie float128) for your arrays to minimize numerical error due to the size of your arrays. Bruce

A Tuesday 06 January 2009, Franck Pommereau escrigué:
s = {} # sum of y values for each distinct x (as keys) n = {} # number of summed values (same keys) for x, y in zip(X, Y) : s[x] = s.get(x, 0.0) + y n[x] = n.get(x, 0) + 1
Maybe this is not so bad with a couple changes? from collections import defaultdict from itertools import izip s = defaultdict(int) # sum of y values for each distinct x (as keys) n = defaultdict(int) # number of summed values (same keys) for x, y in izip(x, y) : s[x] += y n[x] += 1 fwiw, Alan Isaac

Hello, Just thinking. If the parameters are limited, you may be able to use the histogram feature? Doing one histogram with Y as weights, then one without weights and calculating the mean from this yourself should be pretty speedy I imagine. Other then that maybe sorting the whole thing and then doing some searchsorted and side='right' and working on those slices maybe. I mean something like this: def spam(x, y, work_on_copy=False): """Take the arrays x and y and return unique_x_values, means, stds, maxs, mins as lists. means, stds, maxs and mins are those of the corresponding y values. If work_on_copy is true, x and y are copied to ensure that they are not sorted in place. """ u, means, stds, maxs, mins = [], [], [], [], [] s = x.argsort() if work_on_copy: x = x[s] y = y[s] else: x[:] = x[s] y[:] = y[s] start = 0 value = x[0] while True: next = x.searchsorted(value, side='right') u.append(value) means.append(y[start:next].mean()) stds.append(y[start:next].std()) maxs.append(y[start:next].max()) mins.append(y[start:next].min()) if next == len(x): break value = x[next] start = next return u, means, stds, maxs, mins This is of course basically the same as what Francesc suggested, but a quick test shows that it seems to scale better. I didn't try the speed of histogram. Sebastian On Tue, 2009-01-06 at 10:35 +0100, Franck Pommereau wrote:
Hi all, and happy new year!
I'm new to NumPy and searching a way to compute from a set of points (x,y) the mean value of y values associated to each distinct x value. Each point corresponds to a measure in a benchmark (x = parameter, y = computation time) and I'd like to plot the graph of mean computation time wrt parameter values. (I know how to plot, but not how to compute mean values.)
My points are stored as two arrays X, Y (same size). In pure Python, I'd do as follows:
s = {} # sum of y values for each distinct x (as keys) n = {} # number of summed values (same keys) for x, y in zip(X, Y) : s[x] = s.get(x, 0.0) + y n[x] = n.get(x, 0) + 1 new_x = array(list(sorted(s))) new_y = array([s[x]/n[x] for x in sorted(s)])
Unfortunately, this code is much too slow because my arrays have millions of elements. But I'm pretty sure that NumPy offers a way to handle this more elegantly and much faster.
As a bonus, I'd be happy if the solution would allow me to compute also standard deviation, min, max, etc.
Thanks in advance for any help! Franck _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Hi all, First, let me say that I'm impressed: this mailing list is probably the most reactive I've ever seen. I've asked my first question and got immediately more solutions than time to test them... Many thanks to all the answerers. Using the various proposals, I ran two performance tests: - test 1: 2000000 random values - test 2: 1328724 values from my real use case Here are the various functions and how they perform: def f0 (x, y) : """Initial version test 1 CPU times: 13.37s test 2 CPU times: 5.92s """ s, n = {}, {} for a, b in zip(x, y) : s[a] = s.get(a, 0.0) + b n[a] = n.get(a, 0) + 1 return (numpy.array([a for a in sorted(s)]), numpy.array([s[a]/n[a] for a in sorted(s)])) def f1 (x, y) : """Alan G Isaac <aisaac@american.edu> Modified in order to sort the result only once. test 1 CPU times: 10.86s test 2 CPU times: 2.78s defaultdict indeed speeds things up, probably avoiding one of two sorts is good also """ s, n = defaultdict(float), defaultdict(int) for a, b in izip(x, y) : s[a] += b n[a] += 1 new_x = numpy.array([a for a in sorted(s)]) return (new_x, numpy.array([s[a]/n[a] for a in new_x])) def f2 (x, y) : """Francesc Alted <faltet@pytables.org> Modified with preallocation of arrays (it appeared faster) test 1: killed after more than 10 minutes test 2 CPU times: 22.01s This result is not surprising as I guess a quadratic complexity: one pass for each unique value in x, and presumably one nested pass to compute y[x==i] """ u = numpy.unique(x) m = numpy.array(range(len(u))) for pos, i in enumerate(u) : g = y[x == i] m[pos] = g.mean() return u, m def f3 (x, y) : """Sebastian Stephan Berg <sebastian@sipsolutions.net> Modified because I can always work in place. test 1 CPU times: 17.43s test 2 CPU times: 0.21s Adopted! This is definitely the fastest one when using real values. I tried to preallocate arrays by setting u=numpy.unique(x) and the looping on u, but the result is slower, probably because of unique() Compared with f1, its slower on larger arrays of random values. It may be explained by a complexity argument: f1 as a linear complexity (two passes in sequence) while f3 is probably N log N (a sequence of one sort, two passes to set x[:] and y[:] and one loop on each distinct value with a nested searchsorted that is probably logarithmic). But, real values are far from random, and the sort is probably more efficient, as well as the while loop is shorter because there are less values. """ s = x.argsort() x[:] = x[s] y[:] = y[s] u, means, start, value = [], [], 0, x[0] while True: next = x.searchsorted(value, side='right') u.append(value) means.append(y[start:next].mean()) if next == len(x): break value = x[next] start = next return numpy.array(u), numpy.array(means) def f4 (x, y) : """Jean-Baptiste Rudant <boogaloojb@yahoo.fr> test 1 CPU times: 111.21s test 2 CPU times: 13.48s As Jean-Baptiste noticed, this solution is not very efficient (but works almost of-the-shelf). """ recXY = numpy.rec.fromarrays((x, x), names='x, y') return matplotlib.mlab.rec_groupby(recXY, ('x',), (('y', numpy.mean, 'y_avg'),)) A few more remarks. Sebastian Stephan Berg wrote:
Just thinking. If the parameters are limited, you may be able to use the histogram feature? Doing one histogram with Y as weights, then one without weights and calculating the mean from this yourself should be pretty speedy I imagine.
I'm afraid I don't know what the histogram function computes. But this may be something worth to investigate because I think I'll need it later on in order to smooth my graphs (by plotting mean values on intervals). Bruce Southey wrote:
If you use Knuth's one pass approach
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-lin...)
you can write a function to get the min, max, mean and variance/standard deviation in a single pass through the array rather than one pass for each. I do not know if this will provide any advantage as that will probably depend on the size of the arrays.
If I understood well, this algorithm computes the variance of a whole array, I can see how to adapt it to compute mean (already done by the algorithm), max, min, etc., but I did not see how it can be adapted to my case.
Also, please use the highest precision possible (ie float128) for your arrays to minimize numerical error due to the size of your arrays.
Thanks for the advice! So, thank you again everybody. Cheers, Franck

On Wed, Jan 7, 2009 at 6:37 AM, Franck Pommereau <pommereau@univ-paris12.fr> wrote:
def f4 (x, y) : """Jean-Baptiste Rudant <boogaloojb@yahoo.fr>
test 1 CPU times: 111.21s test 2 CPU times: 13.48s
As Jean-Baptiste noticed, this solution is not very efficient (but works almost of-the-shelf). """ recXY = numpy.rec.fromarrays((x, x), names='x, y') return matplotlib.mlab.rec_groupby(recXY, ('x',), (('y', numpy.mean, 'y_avg'),))
This probably will have no impact on your tests, but this looks like a bug. You probably mean: recXY = numpy.rec.fromarrays((x, y), names='x, y') Could you post the code you use to generate you inputs (ie what is x?) I will look into trying some of the suggestions here to improve the performance on rec_groupby. One thing that slows it down is that it supports an arbitrary number of keys -- eg groupby ('year', 'month') -- whereas the examples above are using a single value lookup. JDH

This probably will have no impact on your tests, but this looks like a bug. You probably mean:
recXY = numpy.rec.fromarrays((x, y), names='x, y')
Sure! Thanks.
Could you post the code you use to generate you inputs (ie what is x?)
My code is probably not usable by somebody else than me. I'm presently too busy to clean it and add comments. But as soon as I'll be able to do so, I'll send you the usable version. Cheers, Franck
participants (6)
-
Alan G Isaac
-
Bruce Southey
-
Francesc Alted
-
Franck Pommereau
-
John Hunter
-
Sebastian Stephan Berg