I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) Question: is the string 'Date' looked up at each iteration? If so, this is dumb, but explains my horrible performance. Or, is there a better way to code the above? Can I convert this to something indexed by column number and convert 'Date' to column number "0" upfront? Would this help with speed? W
On Wed, Jul 21, 2010 at 15:12, wheres pythonmonks <wherespythonmonks@gmail.com> wrote:
I have an recarray -- the first column is date.
I have the following function to compute the number of unique dates in my data set:
def byName(): return(len(list(set(d['Date'])) ))
Question: is the string 'Date' looked up at each iteration? If so, this is dumb, but explains my horrible performance. Or, is there a better way to code the above?
len(np.unique(d['Date'])) If you can come up with a self-contained example that we can benchmark, it would help. In my examples, I don't see any hideous performance, but my examples may be missing some crucially important detail about your data that is causing your performance problems. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:
I have an recarray -- the first column is date.
I have the following function to compute the number of unique dates in my data set:
def byName(): return(len(list(set(d['Date'])) ))
What this code does is: 1. d['Date'] Extract an array slice containing the dates. This is fast. 2. set(d['Date']) Make copies of each array item, and box them into Python objects. This is slow. Insert each of the objects in the set. Also this is somewhat slow. 3. list(set(d['Date'])) Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count. 4. len(list(set(d['Date']))) So the slowness arises because the code is copying data around, and boxing it into Python objects. You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html -- Pauli Virtanen
Thank you very much.... better crack open a numpy reference manual instead of relying on my python "intuition". On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen <pav@iki.fi> wrote:
Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:
I have an recarray -- the first column is date.
I have the following function to compute the number of unique dates in my data set:
def byName(): return(len(list(set(d['Date'])) ))
What this code does is:
1. d['Date']
Extract an array slice containing the dates. This is fast.
2. set(d['Date'])
Make copies of each array item, and box them into Python objects. This is slow.
Insert each of the objects in the set. Also this is somewhat slow.
3. list(set(d['Date']))
Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count.
4. len(list(set(d['Date'])))
So the slowness arises because the code is copying data around, and boxing it into Python objects.
You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
However: is there an automatic way to convert a named index to a position? What about looping over tuples of my recarray: for t in d: date = t['Date'] .... I guess that the above does have to lookup 'Date' each time. But the following does not need the hash lookup for each tuple: for t in d: date = t[0] .... Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date'] On Wed, Jul 21, 2010 at 3:47 PM, wheres pythonmonks <wherespythonmonks@gmail.com> wrote:
Thank you very much.... better crack open a numpy reference manual instead of relying on my python "intuition".
On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen <pav@iki.fi> wrote:
Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote:
I have an recarray -- the first column is date.
I have the following function to compute the number of unique dates in my data set:
def byName(): return(len(list(set(d['Date'])) ))
What this code does is:
1. d['Date']
Extract an array slice containing the dates. This is fast.
2. set(d['Date'])
Make copies of each array item, and box them into Python objects. This is slow.
Insert each of the objects in the set. Also this is somewhat slow.
3. list(set(d['Date']))
Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count.
4. len(list(set(d['Date'])))
So the slowness arises because the code is copying data around, and boxing it into Python objects.
You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote:
However: is there an automatic way to convert a named index to a position?
What about looping over tuples of my recarray:
for t in d: date = t['Date'] ....
Why don't you use zip ?
for (date, t) in (d['Date'], d)
That way, you save repetitive calls to __getitem__....
Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date']
Meh. I have a bad feeling about it that it won't be really performant.
What about: idx_by_name = dict(enumerate(d.dtype.names)) Then I can look up the index of the columns I want before the loop, and then access by the index during the loop. - W On Wed, Jul 21, 2010 at 4:29 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote:
However: is there an automatic way to convert a named index to a position?
What about looping over tuples of my recarray:
for t in d: date = t['Date'] ....
Why don't you use zip ?
for (date, t) in (d['Date'], d)
That way, you save repetitive calls to __getitem__....
Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date']
Meh. I have a bad feeling about it that it won't be really performant.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Jul 21, 2010, at 4:35 PM, wheres pythonmonks wrote:
What about:
idx_by_name = dict(enumerate(d.dtype.names))
Then I can look up the index of the columns I want before the loop, and then access by the index during the loop.
Sure. Why don't you try both approaches, time them and document it ? I still bet that manipulating tuples of numbers might be easier and more performant than juggling w/ the fields of a numpy.void, but that's a gut feeling only...
Wed, 21 Jul 2010 16:22:37 -0400, wheres pythonmonks wrote:
However: is there an automatic way to convert a named index to a position?
It's not really a named index -- it's a field name. Since the fields of an array element can be of different size, they cannot be referred to with an array index (in the sense that Numpy understands the concept).
What about looping over tuples of my recarray:
for t in d: date = t['Date'] ....
I guess that the above does have to lookup 'Date' each time.
As Pierre said, you can move the lookups outside the loop. for date in t['Date']: ... If you want to iterate over multiple fields, it may be best to use itertools.izip so that you unbox a single element at a time. However, I'd be quite surprised if the hash lookups would actually take a significant part of the run time: 1) Python dictionaries are ubiquitous and the implementation appears heavily optimized to be fast with strings. 2) The hash of a Python string is cached, and only computed only once. 3) String literals are interned, and represented by a single object only:
'Date' is 'Date' True
So when running the above Python code, the hash of 'Date' is computed exactly once. 4) For small dictionaries containing strings, such as the fields dictionary, I'd expect 1-3) to be dwarfed by the overhead involved in making Python function calls (PyArg_*) and interpreting the bytecode. So as the usual optimization mantra applies here: measure first :) Of course, if you measure and show that the expectations 1-4) are actually wrong, that's fine. -- Pauli Virtanen
My code had a bug: idx_by_name = dict((n,i) for i,n in enumerate(d.dtype.names)) On Wed, Jul 21, 2010 at 4:49 PM, Pauli Virtanen <pav@iki.fi> wrote:
Wed, 21 Jul 2010 16:22:37 -0400, wheres pythonmonks wrote:
However: is there an automatic way to convert a named index to a position?
It's not really a named index -- it's a field name. Since the fields of an array element can be of different size, they cannot be referred to with an array index (in the sense that Numpy understands the concept).
What about looping over tuples of my recarray:
for t in d: date = t['Date'] ....
I guess that the above does have to lookup 'Date' each time.
As Pierre said, you can move the lookups outside the loop.
for date in t['Date']: ...
If you want to iterate over multiple fields, it may be best to use itertools.izip so that you unbox a single element at a time.
However, I'd be quite surprised if the hash lookups would actually take a significant part of the run time:
1) Python dictionaries are ubiquitous and the implementation appears heavily optimized to be fast with strings.
2) The hash of a Python string is cached, and only computed only once.
3) String literals are interned, and represented by a single object only:
>>> 'Date' is 'Date' True
So when running the above Python code, the hash of 'Date' is computed exactly once.
4) For small dictionaries containing strings, such as the fields dictionary, I'd expect 1-3) to be dwarfed by the overhead involved in making Python function calls (PyArg_*) and interpreting the bytecode.
So as the usual optimization mantra applies here: measure first :)
Of course, if you measure and show that the expectations 1-4) are actually wrong, that's fine.
-- Pauli Virtanen
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (4)
-
Pauli Virtanen
-
Pierre GM
-
Robert Kern
-
wheres pythonmonks