Mailman 3 Multiplicity of an entry - NumPy-Discussion

newer
Syntax highlighting for Cython and...

Multiplicity of an entry

Nils Wagner

26 Oct 2009 26 Oct '09

8:04 a.m.

Hi all, how can I obtain the multiplicity of an entry in a list a = ['abc','def','abc','ghij'] The multiplicity of 'abc' is 2. 'def' is 1. 'ghij' is 1. Nils

Show replies by date

Alan G Isaac

26 Oct 26 Oct

12:25 p.m.

On 10/26/2009 4:04 AM, Nils Wagner wrote:

...

how can I obtain the multiplicity of an entry in a list a = ['abc','def','abc','ghij']

That's a Python question, not a NumPy question. So comp.lang.python would be a better forum. But here's a simplest solution:: a = ['abc','def','abc','ghij'] for item in set(a): print item, a.count(item) This is horribly inefficient of course. If you have a big list, if would be *much* better to use defaultdict: from collections import defaultdict myct = defaultdict(int) for item in a: myct[item] += 1 print myct.items() fwiw, Alan Isaac

Christopher Barker

6:12 p.m.

Alan G Isaac wrote:

...

On 10/26/2009 4:04 AM, Nils Wagner wrote:

...
how can I obtain the multiplicity of an entry in a list a = ['abc','def','abc','ghij']

That's a Python question, not a NumPy question.

but we can make it a numpy question! In [15]: a = np.array(['abc','def','abc','ghij']) In [16]: a Out[16]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4') In [17]: for item in set(a): print item, (a == item).sum() abc 2 ghij 1 def 1 I'll leave pro=filing to the OP. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

josef.pktd＠gmail.com

6:26 p.m.

On Mon, Oct 26, 2009 at 2:12 PM, Christopher Barker wrote:

...

Alan G Isaac wrote:

...
On 10/26/2009 4:04 AM, Nils Wagner wrote:

...
how can I obtain the multiplicity of an entry in a list a = ['abc','def','abc','ghij']

That's a Python question, not a NumPy question.

but we can make it a numpy question!

In [15]: a = np.array(['abc','def','abc','ghij'])

In [16]: a Out[16]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4')

In [17]: for item in set(a): print item, (a == item).sum()

It's *very* slow, when there are a large number of items. numpy creates the full boolean array for each item. see also http://projects.scipy.org/scipy/ticket/905 Josef

...

abc 2 ghij 1 def 1

I'll leave pro=filing to the OP.

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nadav Horesh

27 Oct 27 Oct

2:27 a.m.

In principle you could use: np.equal(a,a).sum(0) but, for unknown reason, np.equal operates only on "normal" arrays. maybe you can transform the array to arrays of numbers, for example by hash. Nadav -----הודעה מקורית----- מאת: numpy-discussion-bounces@scipy.org בשם josef.pktd@gmail.com נשלח: ב 26-אוקטובר-09 20:26 אל: Discussion of Numerical Python נושא: Re: [Numpy-discussion] Multiplicity of an entry On Mon, Oct 26, 2009 at 2:12 PM, Christopher Barker wrote:

...

Alan G Isaac wrote:

...
On 10/26/2009 4:04 AM, Nils Wagner wrote:

...
how can I obtain the multiplicity of an entry in a list a = ['abc','def','abc','ghij']

That's a Python question, not a NumPy question.

but we can make it a numpy question!

In [15]: a = np.array(['abc','def','abc','ghij'])

In [16]: a Out[16]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4')

In [17]: for item in set(a): print item, (a == item).sum()

It's *very* slow, when there are a large number of items. numpy creates the full boolean array for each item. see also http://projects.scipy.org/scipy/ticket/905 Josef

...

abc 2 ghij 1 def 1

I'll leave pro=filing to the OP.

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Barker

4:09 p.m.

Nadav Horesh wrote:

...

np.equal(a,a).sum(0)

but, for unknown reason, np.equal operates only on "normal" arrays.

true: In [25]: a Out[25]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4') In [27]: np.equal(a,a) Out[27]: NotImplemented however: In [28]: a == a Out[28]: array([ True, True, True, True], dtype=bool) don't they use the same code? or is "==" reverting to plain old generic python sequence comparison, which would partly explain why it is so slow.

...

maybe you can transform the array to arrays of numbers, for example by hash.

or even easier: In [32]: a2 = a.view(dtype=np.int32) In [33]: a2 Out[33]: array([1633837824, 1684366848, 1633837824, 1734895978]) In [34]: np.equal(a2, a2[0]) Out[34]: array([ True, False, True, False], dtype=bool) though that only works if your strings are a handy length like 4 bytes... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Michael Droettboom

7:31 p.m.

Christopher Barker wrote:

...

Nadav Horesh wrote:

...
np.equal(a,a).sum(0)

but, for unknown reason, np.equal operates only on "normal" arrays.

true:

In [25]: a Out[25]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4')

In [27]: np.equal(a,a) Out[27]: NotImplemented

however:

In [28]: a == a Out[28]: array([ True, True, True, True], dtype=bool)

don't they use the same code? or is "==" reverting to plain old generic python sequence comparison, which would partly explain why it is so slow.

It looks as if "a == a" (that is array_richcompare) is triggering special case code for strings, so it is fast. However, IMHO np.equal should be made to work as well. Can you file a bug and assign it to me (I'm dealing with a number of other string-related things, so I might as well take this too). Mike -- Michael Droettboom Science Software Branch Operations and Engineering Division Space Telescope Science Institute Operated by AURA for NASA

Travis Oliphant

8:04 p.m.

On Oct 27, 2009, at 2:31 PM, Michael Droettboom wrote:

...

Christopher Barker wrote:

...
Nadav Horesh wrote:

...
np.equal(a,a).sum(0)

but, for unknown reason, np.equal operates only on "normal" arrays.

true:

In [25]: a Out[25]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4')

In [27]: np.equal(a,a) Out[27]: NotImplemented

however:

In [28]: a == a Out[28]: array([ True, True, True, True], dtype=bool)

don't they use the same code? or is "==" reverting to plain old generic python sequence comparison, which would partly explain why it is so slow.

It looks as if "a == a" (that is array_richcompare) is triggering special case code for strings, so it is fast. However, IMHO np.equal should be made to work as well. Can you file a bug and assign it to me (I'm dealing with a number of other string-related things, so I might as well take this too).

The array_richcompare special-cased strings not for speed but for actual functionality. Making np.equal work with strings requires changes to the ufunc code itself which was never written to work with "variable-length" data- types (like strings, unicode, and records). There are several things that would have to be fixed. Some of the changes we made to allow for date-time data-types also made it possible to support variable-length strings, but this is non-trivial to implement. It's certainly possible, but I would want to look at any changes you make before committing them to make sure all the issues are being understood. Thanks, -Travis -- Travis Oliphant Enthought Inc. 1-512-536-1057 http://www.enthought.com oliphant@enthought.com

Michael Droettboom

9:07 p.m.

Travis Oliphant wrote:

...

On Oct 27, 2009, at 2:31 PM, Michael Droettboom wrote:

...
Christopher Barker wrote:

...
Nadav Horesh wrote:

...
np.equal(a,a).sum(0)

but, for unknown reason, np.equal operates only on "normal" arrays.

true:

In [25]: a Out[25]: array(['abc', 'def', 'abc', 'ghij'], dtype='|S4')

In [27]: np.equal(a,a) Out[27]: NotImplemented

however:

In [28]: a == a Out[28]: array([ True, True, True, True], dtype=bool)

don't they use the same code? or is "==" reverting to plain old generic python sequence comparison, which would partly explain why it is so slow.

It looks as if "a == a" (that is array_richcompare) is triggering special case code for strings, so it is fast. However, IMHO np.equal should be made to work as well. Can you file a bug and assign it to me (I'm dealing with a number of other string-related things, so I might as well take this too).

The array_richcompare special-cased strings not for speed but for actual functionality.

Making np.equal work with strings requires changes to the ufunc code itself which was never written to work with "variable-length" data- types (like strings, unicode, and records). There are several things that would have to be fixed. Some of the changes we made to allow for date-time data-types also made it possible to support variable-length strings, but this is non-trivial to implement. It's certainly possible, but I would want to look at any changes you make before committing them to make sure all the issues are being understood.

Yeah -- I'm realizing this is a bigger project than I initially suspected. I'll keep you posted if I find the time to do this right. Mike -- Michael Droettboom Science Software Branch Operations and Engineering Division Space Telescope Science Institute Operated by AURA for NASA

5293

Age (days ago)

5294

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Alan G Isaac
Christopher Barker
josef.pktd＠gmail.com
Michael Droettboom
Nadav Horesh
Nils Wagner
Travis Oliphant

Multiplicity of an entry

Nils Wagner

Alan G Isaac

Christopher Barker

josef.pktd＠gmail.com

Nadav Horesh

Christopher Barker

Michael Droettboom

Travis Oliphant

Michael Droettboom

tags

participants (7)