Picking rows with the first (or last) occurrence of each key

(I'm probably going to botch the description...) Suppose I have a 2D array of Python objects, the first n elements of each row form a key, the rest of the elements form the value. Each key can (and generally does) occur multiple times. I'd like to generate a new array consisting of just the first (or last) row for each key occurrence. Rows retain their relative order on output. For example, suppose I have this array with key length 2: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] Selecting the first occurrence of each key would return this array: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 17, 100.3 ] while selecting the last occurrence would return this array: [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] In real life, my array is a bit larger than this example, with the input being on the order of a million rows, and the output being around 5000 rows. Avoiding processing all those extra rows at the Python level would speed things up. I don't know what this filter might be called (though I'm sure I haven't thought of something new), so searching Google or Bing for it would seem to be fruitless. It strikes me as something which numpy or Pandas might already have in their bag(s) of tricks. Pointers appreciated, Skip

Hey Skip, Any way that you can make your keys numeric? Then you can run np.diff on that first column, and use the indices of nonzero entries (np.flatnonzero) to know where values change. With a +1/-1 offset (that I am too lazy to figure out right now ;) you can then index into the original rows to get either the first or last occurrence of each run. Juan. On 2 July 2016 at 10:10:16 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: (I'm probably going to botch the description...) Suppose I have a 2D array of Python objects, the first n elements of each row form a key, the rest of the elements form the value. Each key can (and generally does) occur multiple times. I'd like to generate a new array consisting of just the first (or last) row for each key occurrence. Rows retain their relative order on output. For example, suppose I have this array with key length 2: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] Selecting the first occurrence of each key would return this array: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 17, 100.3 ] while selecting the last occurrence would return this array: [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] In real life, my array is a bit larger than this example, with the input being on the order of a million rows, and the output being around 5000 rows. Avoiding processing all those extra rows at the Python level would speed things up. I don't know what this filter might be called (though I'm sure I haven't thought of something new), so searching Google or Bing for it would seem to be fruitless. It strikes me as something which numpy or Pandas might already have in their bag(s) of tricks. Pointers appreciated, Skip _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

This is trivial in pandas. a simple groupby. In [6]: data = [[ 'a', 27, 14.5 ],['b', 12, 99.0],['a', 17, 100.3], ['b', 12, -329.0]] In [7]: df = DataFrame(data, columns=list('ABC')) In [8]: df Out[8]: A B C 0 a 27 14.5 1 b 12 99.0 2 a 17 100.3 3 b 12 -329.0 In [9]: df.groupby('A').first() Out[9]: B C A a 27 14.5 b 12 99.0 In [10]: df.groupby('A').last() Out[10]: B C A a 17 100.3 b 12 -329.0 On Mon, Jul 4, 2016 at 7:27 PM, Skip Montanaro <skip.montanaro@gmail.com> wrote:

This is trivial in pandas. a simple groupby.
Oh, cool. Precisely the sort of solution I was hoping would turn up. Straightforward, easy for a novice data slinger like me to understand, and likely a bazillion times faster than the straightforward version. Skip

On 4 July 2016 at 7:38:48 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: Oh, cool. Precisely the sort of solution I was hoping would turn up. Except it doesn’t seem to meet your original spec, which retrieved the first item of each *run* of an index value?

Except it doesn’t seem to meet your original spec, which retrieved the first item of each *run* of an index value?
No, just what I was looking for. As I indicated in my original post, "I'm probably going to botch the description..." It's quite likely that my problem specification wasn't as crisp as it should have been. S

On 4 July 2016 at 7:27:47 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: Hashing it probably wouldn't work, too great a chance for collisions. If the string is ASCII, you can always interpret the bytes as part of an 8 byte integer. Or, you can map unique values to consecutive integers.

On Tue, Jul 5, 2016 at 1:03 AM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:
IIUC np.nonzero(a[1] == a[:-1]) gives all changes independent of dtype. add or remove a 1 to adjust which element is indexed. (IIRC from a long time ago, arraysetops used/uses something like this.) Josef

1. This is not a NumPy question; StackExchange would be more appropriate. 2. Do some bookkeeping: def initialKeyFilter(data, length, invert=False): result = list() seen = set() if invert: data = reversed(data) for datum in data: k = tuple(datum[:length]) if (k not in seen): result.append(datum) seen.add(k) if invert: result.reverse() return result Cheers, Alan Isaac

1. This is not a NumPy question; StackExchange would be more appropriate.
Thanks, that is the fairly straightforward -- but slow -- solution, which I have already implemented. I was asking if numpy had some filtering functions which might speed things up (it's a huge library, with which I'm not terribly familiar). It's fine if the answer is "no".

Hey Skip, Any way that you can make your keys numeric? Then you can run np.diff on that first column, and use the indices of nonzero entries (np.flatnonzero) to know where values change. With a +1/-1 offset (that I am too lazy to figure out right now ;) you can then index into the original rows to get either the first or last occurrence of each run. Juan. On 2 July 2016 at 10:10:16 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: (I'm probably going to botch the description...) Suppose I have a 2D array of Python objects, the first n elements of each row form a key, the rest of the elements form the value. Each key can (and generally does) occur multiple times. I'd like to generate a new array consisting of just the first (or last) row for each key occurrence. Rows retain their relative order on output. For example, suppose I have this array with key length 2: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] Selecting the first occurrence of each key would return this array: [ 'a', 27, 14.5 ] [ 'b', 12, 99.0 ] [ 'a', 17, 100.3 ] while selecting the last occurrence would return this array: [ 'a', 27, 15.7 ] [ 'a', 17, 100.3 ] [ 'b', 12, -329.0 ] In real life, my array is a bit larger than this example, with the input being on the order of a million rows, and the output being around 5000 rows. Avoiding processing all those extra rows at the Python level would speed things up. I don't know what this filter might be called (though I'm sure I haven't thought of something new), so searching Google or Bing for it would seem to be fruitless. It strikes me as something which numpy or Pandas might already have in their bag(s) of tricks. Pointers appreciated, Skip _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

This is trivial in pandas. a simple groupby. In [6]: data = [[ 'a', 27, 14.5 ],['b', 12, 99.0],['a', 17, 100.3], ['b', 12, -329.0]] In [7]: df = DataFrame(data, columns=list('ABC')) In [8]: df Out[8]: A B C 0 a 27 14.5 1 b 12 99.0 2 a 17 100.3 3 b 12 -329.0 In [9]: df.groupby('A').first() Out[9]: B C A a 27 14.5 b 12 99.0 In [10]: df.groupby('A').last() Out[10]: B C A a 17 100.3 b 12 -329.0 On Mon, Jul 4, 2016 at 7:27 PM, Skip Montanaro <skip.montanaro@gmail.com> wrote:

This is trivial in pandas. a simple groupby.
Oh, cool. Precisely the sort of solution I was hoping would turn up. Straightforward, easy for a novice data slinger like me to understand, and likely a bazillion times faster than the straightforward version. Skip

On 4 July 2016 at 7:38:48 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: Oh, cool. Precisely the sort of solution I was hoping would turn up. Except it doesn’t seem to meet your original spec, which retrieved the first item of each *run* of an index value?

Except it doesn’t seem to meet your original spec, which retrieved the first item of each *run* of an index value?
No, just what I was looking for. As I indicated in my original post, "I'm probably going to botch the description..." It's quite likely that my problem specification wasn't as crisp as it should have been. S

On 4 July 2016 at 7:27:47 PM, Skip Montanaro (skip.montanaro@gmail.com) wrote: Hashing it probably wouldn't work, too great a chance for collisions. If the string is ASCII, you can always interpret the bytes as part of an 8 byte integer. Or, you can map unique values to consecutive integers.

On Tue, Jul 5, 2016 at 1:03 AM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:
IIUC np.nonzero(a[1] == a[:-1]) gives all changes independent of dtype. add or remove a 1 to adjust which element is indexed. (IIRC from a long time ago, arraysetops used/uses something like this.) Josef

1. This is not a NumPy question; StackExchange would be more appropriate. 2. Do some bookkeeping: def initialKeyFilter(data, length, invert=False): result = list() seen = set() if invert: data = reversed(data) for datum in data: k = tuple(datum[:length]) if (k not in seen): result.append(datum) seen.add(k) if invert: result.reverse() return result Cheers, Alan Isaac

1. This is not a NumPy question; StackExchange would be more appropriate.
Thanks, that is the fairly straightforward -- but slow -- solution, which I have already implemented. I was asking if numpy had some filtering functions which might speed things up (it's a huge library, with which I'm not terribly familiar). It's fine if the answer is "no".
participants (5)
-
Alan Isaac
-
Jeff Reback
-
josef.pktd@gmail.com
-
Juan Nunez-Iglesias
-
Skip Montanaro