Hi all- I'm new to numpy and I'm slowly weening myself off of IDL. So far the experience has been very positive. I use recarrays a lot. I often read these recarrays from fits files using pyfits (modified to work with numpy). I find myself doing the following more than I would like: if 'TAG' in rec._coldefs.names: .... It seems messy to be accessing this "hidden" attribute in this way. Is there a plan to add methods to more transparently do such things? e.g. def fieldnames(): return _coldefs.names def field_exists(fieldname): return fieldname.upper() in _coldefs.names def field_index(fieldname): if field_exists(fieldname): return _coldefs.names.index(fieldname.upper()) else: return -1 # or None maybe Thanks, Erin
On Mar 15, 2006, at 3:20 PM, Erin Sheldon wrote:
Hi all-
I'm new to numpy and I'm slowly weening myself off of IDL. So far the experience has been very positive.
I use recarrays a lot. I often read these recarrays from fits files using pyfits (modified to work with numpy). I find myself doing the following more than I would like:
if 'TAG' in rec._coldefs.names: ....
It seems messy to be accessing this "hidden" attribute in this way. Is there a plan to add methods to more transparently do such things?
e.g.
def fieldnames(): return _coldefs.names
def field_exists(fieldname): return fieldname.upper() in _coldefs.names
def field_index(fieldname): if field_exists(fieldname): return _coldefs.names.index(fieldname.upper()) else: return -1 # or None maybe
Thanks, Erin
You are right that this is messy. We would like to change this sometime. But we'd like to complete the transition to numpy first before doing that so it may be some months before we can (and it may not look quite like what you suggest). But your point is very valid. Thanks, Perry
On 3/15/06, Perry Greenfield <perry@stsci.edu> wrote:
You are right that this is messy. We would like to change this sometime. But we'd like to complete the transition to numpy first before doing that so it may be some months before we can (and it may not look quite like what you suggest). But your point is very valid.
Thanks, Perry
OK, fair enough. Incidentally, I realized that this attribute _coldefs is not part of recarray anyway, but something added by pyfits. I see now that the names and the formats with a greater than sign concatenated on the front can be extracted from dtype: In [247]: t.dtype Out[247]: [('x', '>f4'), ('y', '>i4')] I could write my own function to extract what I need, but I thought I would ask: is there already a simpler way? And is there a function to compare this '>f4' stuff to the named types such as Float32 ('f')? Erin P.S. If it is man power that is preventing some of the simple things like this from being implemented, I could volunteer some time.
Erin Sheldon wrote:
On 3/15/06, Perry Greenfield <perry@stsci.edu> wrote:
You are right that this is messy. We would like to change this sometime. But we'd like to complete the transition to numpy first before doing that so it may be some months before we can (and it may not look quite like what you suggest). But your point is very valid.
Thanks, Perry
OK, fair enough.
Incidentally, I realized that this attribute _coldefs is not part of recarray anyway, but something added by pyfits. I see now that the names and the formats with a greater than sign concatenated on the front can be extracted from dtype:
In [247]: t.dtype Out[247]: [('x', '>f4'), ('y', '>i4')]
I could write my own function to extract what I need, but I thought I would ask: is there already a simpler way? And is there a function to compare this '>f4' stuff to the named types such as Float32 ('f')?
The dtype object does contain what you want. In fact. It's the fields attribute of the dtype object that is a dictionary accessed by field name. Thus, to see if a field is a valid field itdentifier, if name in t.dtype.fields: would work (well there is a slight problem in that -1 is a special key to the dictionary that returns a list of field names ordered by offset and so would work also), but if you now that name is already a string, then no problem. -Travis
Travis Oliphant wrote:
The dtype object does contain what you want. In fact. It's the fields attribute of the dtype object that is a dictionary accessed by field name. Thus, to see if a field is a valid field itdentifier,
if name in t.dtype.fields:
would work (well there is a slight problem in that -1 is a special key to the dictionary that returns a list of field names ordered by offset and so would work also), but if you now that name is already a string, then no problem.
Mmh, just curious: I wonder about the wisdom of that overloading of a 'magic' key (-1). It will make thinks like for name in t.dtype.fields: return a spurious entry (the -1), and having the sorted list accessed as for name in t.dtype.fields[-1]: reads weird. I'm sure there was a good reason behind this, but I wonder if it wouldn't be better to provide this particular functionality (the list currently associated with the special -1 key) via a different mechanism, and guaranteeing that t.dtype.fields.keys() == [ list of valid names ]. It just sounds like enforcing a bit of API orthogonality here would be a good thing in the long run, but perhaps I'm missing something (I don't claim to know the reasoning that went behind today's implementation). Best, f
Fernando Perez wrote:
Mmh, just curious: I wonder about the wisdom of that overloading of a 'magic' key (-1). It will make thinks like
for name in t.dtype.fields:
No real wisdom. More organic growth. Intially I didn't have an ordered list of fields, but as more complicated data-type descriptors were supported, this became an important thing to have. I should have probably added a n additional element to the PyArray_Descr structure. Remember, it was growing out of the old PyArray_Descr already and I was a little concerned about changing it too much (there are ramifications of changing this structure in several places). So, instead of adding a "ordered_names" tuple to the dtype object, I just stuck it in the fields dictionary. I agree it's kind of odd sitting there. It probably deserves a re-factoring and pulling that out into a new attribute of the dtype object. This would mean that the PyArray_Descr structure would need a new object (names perhaps), and it would need to be tracked. Not a huge change and probably worth it before the next release. -Travis
It was suggested that I put off this discussion until we were closer to the 1.0 release. Perhaps now is a good time to bring it up once again? The quick summary: accessing field names has some oddness that needs cleaning up. On 3/15/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Fernando Perez wrote:
Mmh, just curious: I wonder about the wisdom of that overloading of a 'magic' key (-1). It will make thinks like
for name in t.dtype.fields:
No real wisdom. More organic growth. Intially I didn't have an ordered list of fields, but as more complicated data-type descriptors were supported, this became an important thing to have. I should have probably added a n additional element to the PyArray_Descr structure. Remember, it was growing out of the old PyArray_Descr already and I was a little concerned about changing it too much (there are ramifications of changing this structure in several places).
So, instead of adding a "ordered_names" tuple to the dtype object, I just stuck it in the fields dictionary. I agree it's kind of odd sitting there.
It probably deserves a re-factoring and pulling that out into a new attribute of the dtype object.
This would mean that the PyArray_Descr structure would need a new object (names perhaps), and it would need to be tracked.
Not a huge change and probably worth it before the next release.
-Travis
On 3/15/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
The dtype object does contain what you want. In fact. It's the fields attribute of the dtype object that is a dictionary accessed by field name. Thus, to see if a field is a valid field itdentifier,
if name in t.dtype.fields:
would work (well there is a slight problem in that -1 is a special key to the dictionary that returns a list of field names ordered by offset and so would work also), but if you now that name is already a string, then no problem.
Yes, I see, but I think you meant if name in t.dtype.fields.keys(): which contains the -1 as a key. In [275]: t.dtype.fields.keys: Out[275]: ('tag1','tag2',-1) So, since the last key points to the names, one can also do: In [279]: t.dtype.fields[-1] Out[279]: ('tag1', 'tag2') which is not transparent, but does what you need. For now, I'll probably just write a simple function to wrap this. Thanks, Erin
Erin Sheldon wrote:
On 3/15/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
The dtype object does contain what you want. In fact. It's the fields attribute of the dtype object that is a dictionary accessed by field name. Thus, to see if a field is a valid field itdentifier,
if name in t.dtype.fields:
would work (well there is a slight problem in that -1 is a special key to the dictionary that returns a list of field names ordered by offset and so would work also), but if you now that name is already a string, then no problem.
Yes, I see, but I think you meant
if name in t.dtype.fields.keys():
Actually, you can use this with the dictionary itself (no need to get the keys...) name in t.dtype.fields is equivalent to name in t.dtype.fields.keys() -Travis
Erin Sheldon wrote:
Yes, I see, but I think you meant
if name in t.dtype.fields.keys():
No, he really meant: if name in t.dtype.fields: dictionaries are iterators, so you don't need to construct the list of keys separately. It's just a redundant waste of time and memory in most cases, unless you intend to modify the dict in your loop, case in which the iterator approach won't work and you /do/ need the explicit keys() call. In addition if name in t.dtype.fields is faster than: if name in t.dtype.fields.keys() While both are O(N) operations, the first requires a single call to the hash function on 'name' and then a C lookup in the dict's internal key table as a hash table, while the second is a direct walkthrough of a list with python-level equality testing. In [15]: nkeys = 1000000 In [16]: dct = dict(zip(keys,[None]*len(keys))) In [17]: time bool(-1 in keys) CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 Out[17]: False In [18]: time bool(-1 in dct) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 Out[18]: False In realistic cases for your original question you are not likely to see the difference, but it's always a good idea to be aware of the performance characteristics of various approaches. For a different problem, there may well be a real difference. Cheers, f
Fernando Perez wrote:
In addition
if name in t.dtype.fields
is faster than:
if name in t.dtype.fields.keys()
While both are O(N) operations, the first requires a single call to the hash function on 'name' and then a C lookup in the dict's internal key table as a hash table, while the second is a direct walkthrough of a list with python-level equality testing.
[ sorry, copy-pasted wrong timing run] In [1]: nkeys = 5000000 In [2]: keys=range(nkeys) In [3]: dct = dict(zip(keys,[None]*len(keys))) In [4]: time bool(-1 in dct) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 Out[4]: False In [5]: time bool(-1 in keys) CPU times: user 0.32 s, sys: 0.00 s, total: 0.32 s Wall time: 0.33 Out[5]: False Cheers, f
Nice. Python decides to compare with the keys and not the values. The possibilities for obfuscation are endless. On 3/15/06, Fernando Perez <Fernando.Perez@colorado.edu> wrote:
Erin Sheldon wrote:
Yes, I see, but I think you meant
if name in t.dtype.fields.keys():
No, he really meant:
if name in t.dtype.fields:
dictionaries are iterators, so you don't need to construct the list of keys separately. It's just a redundant waste of time and memory in most cases, unless you intend to modify the dict in your loop, case in which the iterator approach won't work and you /do/ need the explicit keys() call.
In addition
if name in t.dtype.fields
is faster than:
if name in t.dtype.fields.keys()
While both are O(N) operations, the first requires a single call to the hash function on 'name' and then a C lookup in the dict's internal key table as a hash table, while the second is a direct walkthrough of a list with python-level equality testing.
In [15]: nkeys = 1000000
In [16]: dct = dict(zip(keys,[None]*len(keys)))
In [17]: time bool(-1 in keys) CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 Out[17]: False
In [18]: time bool(-1 in dct) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 Out[18]: False
In realistic cases for your original question you are not likely to see the difference, but it's always a good idea to be aware of the performance characteristics of various approaches. For a different problem, there may well be a real difference.
Cheers,
f
Erin Sheldon wrote:
Nice. Python decides to compare with the keys and not the values.
Sure. It is a ridiculously common to ask a dictionary if it has a record for a particular key. It is much, much rarer to ask one if it has a particular value. Lists, tuples, and sets, on the other hand, only have one kind of interesting data, the values, so the __contains__ method operates on values with them. Practicality beats purity, in this case.
The possibilities for obfuscation are endless.
Not in my experience. -- Robert Kern robert.kern@gmail.com "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Mar 15, 2006, at 6:41 PM, Erin Sheldon wrote:
On 3/15/06, Perry Greenfield <perry@stsci.edu> wrote:
You are right that this is messy. We would like to change this sometime. But we'd like to complete the transition to numpy first before doing that so it may be some months before we can (and it may not look quite like what you suggest). But your point is very valid.
Thanks, Perry
Erin P.S. If it is man power that is preventing some of the simple things like this from being implemented, I could volunteer some time.
Ultimately, yes. But there's a little more to it. We've felt there are a number of improvements that can be made and that these might benefit from thinking more globally about how to do them rather than add a few convenience functions. Then there is the issue of whether these changes are added to the numarray version, etc. But help would be welcomed. I'd say wait a little bit until we put out an alpha and then start some discussion on how it should be improved. The astropy mailing list is a better forum for that than this one though. Thanks, Perry
participants (5)
-
Erin Sheldon
-
Fernando Perez
-
Perry Greenfield
-
Robert Kern
-
Travis Oliphant