[Tutor] Change datatype for specific columns in an 2D array & computing the mean

Mon Jan 25 09:49:26 EST 2016

On 25 January 2016 at 13:14, Peter Otten <__peter__ at web.de> wrote:
>> What do you mean by "group rows"?
>
> Given a table you can specify columns as keys and in the simplest case one
> column where you apply an aggregate function over the sets of rows with the
> same key.
>
> If I understand you correctly you are doing that for a single known key,
> whereas I considered finding the keys part of the task. In SQL you'd spell
> that
>
> [prob 1]
> select key1, key2, sum(value) from some_table group by key1, key2;
>
>> I thought the OP's problem is really to filter rows which I already
>> showed how to do in numpy.
>
> You solve
>
> [prob 2]
> select sum(value) from some_table where key1=? and key2=?;
>
> You'll eventually get from [prob 2] to [prob 1], but you need a few lines of
> Python.

Oh okay. It wasn't clear to me that was what the OP wanted.

You can do it in numpy by combining a few pieces. First we define an array:

In [1]: import numpy as np

In [2]: a = np.array([[5, 1.0], [1, 3.0], [5, 2.0], [1, 4.0], [5, -1.0]])

In [3]: a
Out[3]:
array([[ 5.,  1.],
       [ 1.,  3.],
       [ 5.,  2.],
       [ 1.,  4.],
       [ 5., -1.]])

Now we need to sort the array:

In [4]: a = np.sort(a, axis=0)

In [5]: a
Out[5]:
array([[ 1., -1.],
       [ 1.,  1.],
       [ 5.,  2.],
       [ 5.,  3.],
       [ 5.,  4.]])

Now we can access the 2nd column easy:

In [6]: a[:, 1]
Out[6]: array([-1.,  1.,  2.,  3.,  4.])

But we want to split that column according to the first column. We can
use the split function if we know the indices and we can get them with
diff:

In [7]: np.diff(a[:, 0])
Out[7]: array([ 0.,  4.,  0.,  0.])

n [14]: np.nonzero(np.diff(a[:, 0]))[0]
Out[14]: array([1])

In [15]: indices = np.nonzero(np.diff(a[:, 0]))[0] + 1

In [16]: indices
Out[16]: array([2])

Now we can use these to split the second column of the sorted array:

In [17]: grouped = np.split(a[:, 1], indices)

In [18]: grouped
Out[18]: [array([-1.,  1.]), array([ 2.,  3.,  4.])]

In [19]: list(map(np.mean, grouped))
Out[19]: [0.0, 3.0]

It's not exactly straight-forward but numpy has all the primitives to
make this reasonably efficient. If we also want the list of keys then:

In [23]: a[np.concatenate([[0], indices]), 0]
Out[23]: array([ 1.,  5.])

Altogether:

import numpy as np

a = np.array([[5, 1.0], [1, 3.0], [5, 2.0], [1, 4.0], [5, -1.0]])

a = np.sort(a, axis = 0)
indices = np.nonzero(np.diff(a[:, 0]))[0] + 1
means = list(map(np.mean, np.split(a[:, 1], indices)))
keys = a[np.concatenate([[0], indices]), 0]
group_means = dict(zip(keys, means))

print(group_means) # {1.0: 0.0, 5.0: 3.0}

If you want to key on multiple columns use lexsort instead of sort and
sum the diff array along rows but otherwise it's the same principle.

Looks easier with pandas :)

--
Oscar