Date: Sat, 13 Feb 2016 22:41:13 -0500
From: Allan Haldane <allanhaldane@gmail.com>
To: numpy-discussion@scipy.org
Subject: Re: [Numpy-discussion] [Suggestion] Labelled Array
Message-ID: <56BFF759.7010505@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Possibly there's still a case for including a 'groupby' function in
numpy itself since it's a generally useful operation, but I do see less
of a need given the nice pandas functionality.
At least, next time someone asks a stackoverflow question like the ones
below someone should tell them to use pandas!
(copied from my gist for future list reference).
In [10]: pd.options.display.max_rows=10
In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000)
In [15]: v = np.arange(100000)
In [16]: df = DataFrame({'v' : v, 'c' : c})
In [17]: df
Out[17]:
c v
0 15 0
1 19 1
2 6 2
3 21 3
4 12 4
... .. ...
99995 7 99995
99996 2 99996
99997 27 99997
99998 28 99998
99999 7 99999
[100000 rows x 2 columns]
In [19]: df.groupby('c').count()
Out[19]:
v
c
0 3136
1 3229
2 3093
3 3121
4 3041
.. ...
27 3128
28 3063
29 3147
30 3073
31 3090
[32 rows x 1 columns]
In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop
In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop
In [22]: df.groupby('c').mean()
Out[22]:
v
c
0 49883.384885
1 50233.692165
2 48634.116069
3 50811.743992
4 50505.368629
.. ...
27 49715.349425
28 50363.501469
29 50485.395933
30 50190.155223
31 50691.041748
[32 rows x 1 columns]
On Sat, Feb 13, 2016 at 1:29 PM, <josef.pktd@gmail.com
<mailto:josef.pktd@gmail.com>> wrote:
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
<allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
Here's a comparison of the naive vs a better implementation:
def split_classes_naive(c, v):
return [v[c == u] for u in unique(c)]
def split_classes(c, v):
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]
>>> c = randint(0,32,size=100000)
>>> v = arange(100000)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop
The usecases I recently started to target for similar things is 1
Million or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I
guess.
Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.
reduce_at looks nice for cases where it applies.
In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.
Josef
In any case, maybe it is useful to Sergio or others.
Allan
On 02/13/2016 12:11 PM, Allan Haldane wrote:
I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does
def split_classes(c, v):
return [v[c == u] for u in unique(c)]
Your example could be coded as
>>> [sum(c) for c in split_classes(label, data)]
[9, 12, 15]
I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor
performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).
If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.
I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
Allan
On 02/12/2016 09:40 AM, S?rgio wrote:
Hello,
This is my first e-mail, I will try to make the idea
simple.
Similar to masked array it would be interesting to use a
label array to
guide operations.
Ex.:
>>> x
labelled_array(data =
[[0 1 2]
[3 4 5]
[6 7 8]],
label =
[[0 1 2]
[0 1 2]
[0 1 2]])
>>> sum(x)
array([9, 12, 15])
The operations would create a new axis for label
indexing.
You could think of it as a collection of masks, one for
each label.
I don't know a way to make something like this
efficiently without a
loop. Just wondering...
S?rgio.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
<mailto:NumPy-Discussion@scipy.org>
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion