Mailman 3 recfunctions.stack_arrays - NumPy-Discussion

newer
optimise operation in array with...

recfunctions.stack_arrays

Ryan May

27 Jan 2009 27 Jan '09

7:37 p.m.

Pierre (or anyone else who cares to chime in), I'm using stack_arrays to combine data from two different files into a single array. In one of these files, the data from one entire record comes back missing, which, thanks to your recent change, ends up having a boolean dtype. There is actual data for this same field in the 2nd file, so it ends up having the dtype of float64. When I try to combine the two arrays, I end up with the following traceback: data = stack_arrays((old_data, data)) File "/home/rmay/.local/lib64/python2.5/site-packages/metpy/cbook.py", line 260, in stack_arrays output = ma.masked_all((np.sum(nrecords),), newdescr) File "/home/rmay/.local/lib64/python2.5/site-packages/numpy/ma/extras.py", line 79, in masked_all a = masked_array(np.empty(shape, dtype), ValueError: two fields with the same name Which is unsurprising. Do you think there is any reasonable way to get stack_arrays() to find a common dtype for fields with the same name? Or another suggestion on how to approach this? If you think coercing one/both of the fields to a common dtype is the way to go, just point me to a function that could figure out the dtype and I'll try to put together a patch. Thanks, Ryan P.S. Thanks so much for your work on putting those utility functions in recfunctions.py It makes it so much easier to have these functions available in the library itself rather than needing to reinvent the wheel over and over. -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

Show replies by date

Pierre GM

27 Jan 27 Jan

8:03 p.m.

[Some background: we're talking about numpy.lib.recfunctions, a set of functions to manipulate structured arrays] Ryan, If the two files have the same structure, you can use that fact and specify the dtype of the output directly with the dtype parameter of mafromtxt. That way, you're sure that the two arrays will have the same dtype. If you don't know the structure beforehand, you could try to load one array and use its dtype as input of mafromtxt to load the second one. Now, we could also try to modify stack_arrays so that it would take the largest dtype when several fields have the same name. I'm not completely satisfied by this approach, as it makes dtype conversions under the hood. Maybe we could provide the functionality as an option (w/ a forced_conversion boolean input parameter) ? I'm a bit surprised by the error message you get. If I try:

...

...
...
a = ma.array([(1,2,3)], mask=[(0,1,0)], dtype=[('a',int), ('b',bool), ('c',float)]) b = ma.array([(4, 5, 6)], dtype=[('a', int), ('b', float), ('c', float)]) test = np.stack_arrays((a, b))

I get a TypeError instead (the field 'b' hasn't the same type in a and b). Now, I get the 'two fields w/ the same name' when I use np.merge_arrays (with the flatten option). Could you send a small example ?

...

P.S. Thanks so much for your work on putting those utility functions in recfunctions.py It makes it so much easier to have these functions available in the library itself rather than needing to reinvent the wheel over and over.

Indeed. Note that most of the job had been done by John Hunter and the matplotlib developer in their matplotlib.mlab module, so you should thank them and not me. I just cleaned up some of the functions.

Ryan May

9:23 p.m.

Pierre GM wrote:

...

[Some background: we're talking about numpy.lib.recfunctions, a set of functions to manipulate structured arrays]

Ryan, If the two files have the same structure, you can use that fact and specify the dtype of the output directly with the dtype parameter of mafromtxt. That way, you're sure that the two arrays will have the same dtype. If you don't know the structure beforehand, you could try to load one array and use its dtype as input of mafromtxt to load the second one.

I could force the dtype. However, since the flexibility is there in mafromtxt, I'd like to avoid hard coding the dtype, so I don't have to worry about updating the code if the file format ever changes (this parses live data).

...

Now, we could also try to modify stack_arrays so that it would take the largest dtype when several fields have the same name. I'm not completely satisfied by this approach, as it makes dtype conversions under the hood. Maybe we could provide the functionality as an option (w/ a forced_conversion boolean input parameter) ?

I definitely wouldn't advocate magic by default, but I think it would be nice to be able to get the functionality if one wanted to. There is one problem I noticed, however. I found common_type and lib.mintypecode, but both raise errors when trying to find a dtype to match both bool and float. I don't know if there's another function somewhere that would work for what I want.

...

I'm a bit surprised by the error message you get. If I try:

...
...
...
a = ma.array([(1,2,3)], mask=[(0,1,0)], dtype=[('a',int), ('b',bool), ('c',float)]) b = ma.array([(4, 5, 6)], dtype=[('a', int), ('b', float), ('c', float)]) test = np.stack_arrays((a, b))

I get a TypeError instead (the field 'b' hasn't the same type in a and b). Now, I get the 'two fields w/ the same name' when I use np.merge_arrays (with the flatten option). Could you send a small example ?

Apparently, I get my error as a result of my use of titles in the dtype to store an alternate name for the field. (If you're not familiar with titles, they're nice because you can get fields by either name, so for the following example, a['a'] and a['A'] both return array([1]).) The following version of your case gives me the ValueError:

...

...
...
from numpy.lib.recfunctions import stack_arrays a = ma.array([(1,2,3)], mask=[(0,1,0)], dtype=[(('a','A'),int), (('b','B'),bool), (('c','C'),float)]) b = ma.array([(4,5,6)], dtype=[(('a','A'),int), (('b','B'),float), (('c','C'),float)]) stack_arrays((a,b)) ValueError: two fields with the same name

As a side question, do you have some local mods to your numpy SVN so that some of the functions in recfunctions are available in numpy's top level? On mine, I can't get to them except by importing them from numpy.lib.recfunctions. I don't see any mention of recfunctions in lib/__init__.py. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

Pierre GM

10 p.m.

On Jan 27, 2009, at 4:23 PM, Ryan May wrote:

...

I definitely wouldn't advocate magic by default, but I think it would be nice to be able to get the functionality if one wanted to.

OK. Put on the TODO list.

...

There is one problem I noticed, however. I found common_type and lib.mintypecode, but both raise errors when trying to find a dtype to match both bool and float. I don't know if there's another function somewhere that would work for what I want.

I'm not familiar with these functions, I'll check that.

...

Apparently, I get my error as a result of my use of titles in the dtype to store an alternate name for the field. (If you're not familiar with titles, they're nice because you can get fields by either name, so for the following example, a['a'] and a['A'] both return array([1]).) The following version of your case gives me the ValueError:

Ah OK. You found a bug. There's a frustrating feature of dtypes: dtype.names doesn't always match [_[0] for _ in dtype.descr].

...

As a side question, do you have some local mods to your numpy SVN so that some of the functions in recfunctions are available in numpy's top level?

Probably. I used the develop option of setuptools to install numpy on a virtual environment.

...

On mine, I can't get to them except by importing them from numpy.lib.recfunctions. I don't see any mention of recfunctions in lib/__init__.py.

Well, till some problems are ironed out, I'm not really in favor of advertising them too much...

5831

Age (days ago)

5831

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

Pierre GM
Ryan May

recfunctions.stack_arrays

Ryan May

Pierre GM

Ryan May

Pierre GM

tags

participants (2)