[Tutor] Change datatype for specific columns in an 2D array & computing the mean
Oscar Benjamin
oscar.j.benjamin at gmail.com
Mon Jan 25 09:49:26 EST 2016
On 25 January 2016 at 13:14, Peter Otten <__peter__ at web.de> wrote:
>> What do you mean by "group rows"?
>
> Given a table you can specify columns as keys and in the simplest case one
> column where you apply an aggregate function over the sets of rows with the
> same key.
>
> If I understand you correctly you are doing that for a single known key,
> whereas I considered finding the keys part of the task. In SQL you'd spell
> that
>
> [prob 1]
> select key1, key2, sum(value) from some_table group by key1, key2;
>
>> I thought the OP's problem is really to filter rows which I already
>> showed how to do in numpy.
>
> You solve
>
> [prob 2]
> select sum(value) from some_table where key1=? and key2=?;
>
> You'll eventually get from [prob 2] to [prob 1], but you need a few lines of
> Python.
Oh okay. It wasn't clear to me that was what the OP wanted.
You can do it in numpy by combining a few pieces. First we define an array:
In [1]: import numpy as np
In [2]: a = np.array([[5, 1.0], [1, 3.0], [5, 2.0], [1, 4.0], [5, -1.0]])
In [3]: a
Out[3]:
array([[ 5., 1.],
[ 1., 3.],
[ 5., 2.],
[ 1., 4.],
[ 5., -1.]])
Now we need to sort the array:
In [4]: a = np.sort(a, axis=0)
In [5]: a
Out[5]:
array([[ 1., -1.],
[ 1., 1.],
[ 5., 2.],
[ 5., 3.],
[ 5., 4.]])
Now we can access the 2nd column easy:
In [6]: a[:, 1]
Out[6]: array([-1., 1., 2., 3., 4.])
But we want to split that column according to the first column. We can
use the split function if we know the indices and we can get them with
diff:
In [7]: np.diff(a[:, 0])
Out[7]: array([ 0., 4., 0., 0.])
n [14]: np.nonzero(np.diff(a[:, 0]))[0]
Out[14]: array([1])
In [15]: indices = np.nonzero(np.diff(a[:, 0]))[0] + 1
In [16]: indices
Out[16]: array([2])
Now we can use these to split the second column of the sorted array:
In [17]: grouped = np.split(a[:, 1], indices)
In [18]: grouped
Out[18]: [array([-1., 1.]), array([ 2., 3., 4.])]
In [19]: list(map(np.mean, grouped))
Out[19]: [0.0, 3.0]
It's not exactly straight-forward but numpy has all the primitives to
make this reasonably efficient. If we also want the list of keys then:
In [23]: a[np.concatenate([[0], indices]), 0]
Out[23]: array([ 1., 5.])
Altogether:
import numpy as np
a = np.array([[5, 1.0], [1, 3.0], [5, 2.0], [1, 4.0], [5, -1.0]])
a = np.sort(a, axis = 0)
indices = np.nonzero(np.diff(a[:, 0]))[0] + 1
means = list(map(np.mean, np.split(a[:, 1], indices)))
keys = a[np.concatenate([[0], indices]), 0]
group_means = dict(zip(keys, means))
print(group_means) # {1.0: 0.0, 5.0: 3.0}
If you want to key on multiple columns use lexsort instead of sort and
sum the diff array along rows but otherwise it's the same principle.
Looks easier with pandas :)
--
Oscar
More information about the Tutor
mailing list