Howdy,
we recently had a discussion about being able to do some common things like reductions and binary operations on recarrays, and there didn't seem to be much consensus on it being needed in the core of numpy.
Since we do actually need this quite pressingly for everyday tasks, we wrote a very simple version of this today, I'm attaching it here in case it proves useful to others.
Basically it lets you do reductions and binary operations on record arrays whose dtype is a simple composite of native ones (example at the end). For our needs it's quite useful, so it may also be to others.
Cheers,
f
Example:
import recarrutil as ru dt = np.dtype(dict(names=['x','y'],formats=[float,float])) x = np.arange(6,dtype=float).reshape(2,3) y = np.arange(10,16,dtype=float).reshape(2,3) z = np.empty( (2,3), dt).view(np.recarray) z.x = x z.y = y z
rec.array([[(0.0, 10.0), (1.0, 11.0), (2.0, 12.0)], [(3.0, 13.0), (4.0, 14.0), (5.0, 15.0)]], dtype=[('x', '<f8'), ('y', '<f8')])
ru.mean(z)
rec.array((2.5, 12.5), dtype=[('x', '<f8'), ('y', '<f8')])
ru.mean(z,0)
rec.array([(1.5, 11.5), (2.5, 12.5), (3.5, 13.5)], dtype=[('x', '<f8'), ('y', '<f8')])
ru.mean(z,1)
rec.array([(1.0, 11.0), (4.0, 14.0)], dtype=[('x', '<f8'), ('y', '<f8')])
ru.add(z,z)
rec.array([[(0.0, 20.0), (2.0, 22.0), (4.0, 24.0)], [(6.0, 26.0), (8.0, 28.0), (10.0, 30.0)]], dtype=[('x', '<f8'), ('y', '<f8')])
ru.subtract(z,z)
rec.array([[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0)], [(0.0, 0.0), (0.0, 0.0), (0.0, 0.0)]], dtype=[('x', '<f8'), ('y', '<f8')])
2009/7/30 Fernando Perez fperez.net@gmail.com:
we recently had a discussion about being able to do some common things like reductions and binary operations on recarrays, and there didn't seem to be much consensus on it being needed in the core of numpy.
Since we do actually need this quite pressingly for everyday tasks, we wrote a very simple version of this today, I'm attaching it here in case it proves useful to others.
I'm in favour of such a patch, but I'd like to see whether we can't do it at the C level for structured arrays in general.
Regards Stéfan
2009/7/30 Stéfan van der Walt stefan@sun.ac.za:
2009/7/30 Fernando Perez fperez.net@gmail.com:
we recently had a discussion about being able to do some common things like reductions and binary operations on recarrays, and there didn't seem to be much consensus on it being needed in the core of numpy.
Since we do actually need this quite pressingly for everyday tasks, we wrote a very simple version of this today, I'm attaching it here in case it proves useful to others.
I'm in favour of such a patch, but I'd like to see whether we can't do it at the C level for structured arrays in general.
Regards Stéfan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Are these functions really for a relevant use case of structured arrays. I haven't seen any examples of multidimensional structured arrays, but from a quick reading it doesn't seem to handle mixed types (raises error) or nested structured arrays (I'm not sure), for which I have seen a lot more examples.
I was looking for or writing something similar but only for 1d structured arrays, i.e. a 2d dataset.
for homogenous dtypes it is relatively easy to create a view on which standard array operations can be applied.
mixed dtypes however, I wanted a version that can handle mixed dtypes, in my case integer and floats, that upcasts all numerical types to the highest dtype, to floats in my examples. integers where categorical data that I want as integers for e.g. np.bincount .
temporary/ conversion array reuse When many array operations have to be applied to the data of the structured array, it is better to keep a converted copy of the structured array around, instead of doing the conversion each time. Although, since it's a copy, I used it read only. For example calculating in sequence mean, variance and correlation, deviations from mean and so on, requires only one conversion.
For me it would have been more useful to have better documentation and helper functions to convert structured arrays to standard arrays for the calculations.
I looked at this mostly for the statistical use, where I didn't want the result to be structured arrays, so these recarrutil might not be of much use in this case, and consist to a large part of functionality that won't be needed, e.g. the multidimensional overhead.
The recarray helper functions are useful and build in support as Stefan proposes would be nice. However, since I started only recently to us, I'm not sure what the relevant structure (dimensionality and dtypes) of structured/rec arrays are. But nested and mixed dtypes seem to be more important than multidimensionality in the examples I have seen.
For example when we don't have a balanced panel so the structured array cannot be reshaped into a rectangular shape according to some variables, then reductions and operations like groupby are more useful for data analysis http://matplotlib.sourceforge.net/api/mlab_api.html#matplotlib.mlab.rec_grou...
my 2c, after a brief look at the code
Josef
On Thu, Jul 30, 2009 at 7:55 AM, josef.pktd@gmail.com wrote:
Are these functions really for a relevant use case of structured arrays. I haven't seen any examples of multidimensional structured arrays, but from a quick reading it doesn't seem to handle mixed types (raises error) or nested structured arrays (I'm not sure), for which I have seen a lot more examples.
In our work, multidimensional record arrays with homogeneous types are a regular occurrence. This code was written for *our* problems, not to be completely general, and it was posted as "if it's useful to you, feel free to use it". I don't have the time/bandwidth to work on this idea for core numpy inclusion.
Your other comments are all equally valid ideas, and all those would be necessary considerations for someone who wants to develop something like this to have full generality. Other things that would need to be done:
- Proper support for broadcasting - mixed binary ops with scalars or normal arrays - masked array support
Perhaps some enterprising soul will come back later with a robust implementation of all this... But I'm afraid that won't be me :) This discussion could serve as a good starting point of simple code and points to keep in mind for such a task, so thanks for the feedback.
Cheers,
f
On Thu, Jul 30, 2009 at 2:41 PM, Fernando Perezfperez.net@gmail.com wrote:
On Thu, Jul 30, 2009 at 7:55 AM, josef.pktd@gmail.com wrote:
Are these functions really for a relevant use case of structured arrays. I haven't seen any examples of multidimensional structured arrays, but from a quick reading it doesn't seem to handle mixed types (raises error) or nested structured arrays (I'm not sure), for which I have seen a lot more examples.
In our work, multidimensional record arrays with homogeneous types are a regular occurrence. This code was written for *our* problems, not to be completely general, and it was posted as "if it's useful to you, feel free to use it". I don't have the time/bandwidth to work on this idea for core numpy inclusion.
Your other comments are all equally valid ideas, and all those would be necessary considerations for someone who wants to develop something like this to have full generality. Other things that would need to be done:
- Proper support for broadcasting
- mixed binary ops with scalars or normal arrays
- masked array support
Perhaps some enterprising soul will come back later with a robust implementation of all this... But I'm afraid that won't be me :) This discussion could serve as a good starting point of simple code and points to keep in mind for such a task, so thanks for the feedback.
Cheers,
f
Thanks for the example. I tried to read through the code of the different recarray utility functions to see how you actually use structured arrays for data analysis, instead of just as a storage dictionary. Code examples are the most useful documentation that I have found for this.
Josef
2009/7/30 Stéfan van der Walt stefan@sun.ac.za:
I'm in favour of such a patch, but I'd like to see whether we can't do it at the C level for structured arrays in general.
That would indeed be ideal. But I should add that I was not proposing it as a patch, rather as a utility others might find useful to keep around. I simply don't have the bandwidth right now to develop this idea to the level where it can be pushed upstream, yet we find these tools very useful for our daily work, so I figured they could also be useful to someone else.
Cheers,
f