I'm working on some two-dimensional tables of data, where some data are numerical, while other aren't. I'd like to use numarray's numerical capabilities with the numerical parts (columns) while keeping the data in each row together. (I'm sure this generalizes to more dimensions, and to sub-arrays in general, not just rows.) It's not a hard problem, really, but the obvious solution--to keep the other rows in separate arrays/lists and just juggle things around--seems a bit clunky. I was just wondering if anyone had other ideas (would it be practical to include all the data in a single array somehow--I seem to recall that Numeric could have arbitrary object arrays, but I'm not sure whether numarray supports this?) or perhaps some hints on how to organize code around this? I wrote a small class that wraps things up and works a bit lik R/S-plus's data frames; is there some other more standard code out there for this sort of thing? (It's a problem that crops up often in data processing of various kinds...) Thanks, Magnus -- Magnus Lie Hetland http://hetland.org
On Fri, 2002-12-27 at 11:29, Magnus Lie Hetland wrote:
I'm working on some two-dimensional tables of data, where some data are numerical, while other aren't. I'd like to use numarray's numerical capabilities with the numerical parts (columns) while keeping the data in each row together. (I'm sure this generalizes to more dimensions, and to sub-arrays in general, not just rows.)
It's not a hard problem, really, but the obvious solution--to keep the other rows in separate arrays/lists and just juggle things around--seems a bit clunky. I was just wondering if anyone had other ideas (would it be practical to include all the data in a single array somehow--I seem to recall that Numeric could have arbitrary object arrays, but I'm not sure whether numarray supports this?) or perhaps some hints on how to organize code around this? I wrote a small class that wraps things up and works a bit lik R/S-plus's data frames; is there some other more standard code out there for this sort of thing? (It's a problem that crops up often in data processing of various kinds...)
Have a look at the discussion on RecordArrays in this overview of Numarray: http://stsdas.stsci.edu/numarray/DesignOverview.html However, in the meantime, as you note, its not too hard to write a class which emulates R/S-Plus data frames. Just store each column in its own Numeric array of the appropriate type (which might be the PyObject types, which can hold any Python object type), and have the wrapper class implement __getitem__ etc to collect the relevant "rows" from each column and return them as a complete row as a dict or a sequence. Not that fast, but not slow either. You can implement a generator to allow cursor-like traversal of the all the rows if you like. Happy to collaborate on furthering this idea. By memory-mapping disc-based versions of the Numeric arrays, and using the BsdDb3 record number database format for the string columns, you can even make a disc-based "record array" which can be larger than available RAM+swap. I hope to release code written under contract by Dave Cole (see http://www.object-craft.com.au ) which illustrates this idea in the next month or so (but I've been saying that to myself for a year or more...). Tim C
Tim Churches <tchur@optushome.com.au>: [snip]
Have a look at the discussion on RecordArrays in this overview of Numarray: http://stsdas.stsci.edu/numarray/DesignOverview.html
Sounds interesting.
However, in the meantime, as you note, its not too hard to write a class which emulates R/S-Plus data frames. Just store each column in its own Numeric array of the appropriate type
Yeah -- it's just that I'd like to keep a set of columns collected as a two-dimensional array, to allow horizontal summing and the like. (Not much more complicated, but an extra issue to address.)
(which might be the PyObject types, which can hold any Python object type),
Hm. Yes. I can't seem to find these anymore. I seem to recall using type='o' or something in Numeric, but I can't find the right type objects now... (Guess I'm just reading the docs and dir(numeric) poorly...) It would be nice if array(['foo']) just worked. Oh, well. [snip]
Happy to collaborate on furthering this idea.
That would be great (even though I don't really have any time to use for this -- it's just a really tiny part of a small project I'm working on :)
By memory-mapping disc-based versions of the Numeric arrays, and using the BsdDb3 record number database format for the string columns, you can even make a disc-based "record array" which can be larger than available RAM+swap.
Sounds quite useful, although quite similar to MetaKit. (I suppose I could use some functions from numarray on columns in MetaKit... But that might just be too weird -- and it would still just be a collection of columns :]) [snip] Thanks for your input. -- Magnus Lie Hetland http://hetland.org
On Fri, 2002-12-27 at 12:55, Magnus Lie Hetland wrote:
Tim Churches <tchur@optushome.com.au>: [snip]
Have a look at the discussion on RecordArrays in this overview of Numarray: http://stsdas.stsci.edu/numarray/DesignOverview.html
Sounds interesting.
However, in the meantime, as you note, its not too hard to write a class which emulates R/S-Plus data frames. Just store each column in its own Numeric array of the appropriate type
Yeah -- it's just that I'd like to keep a set of columns collected as a two-dimensional array, to allow horizontal summing and the like. (Not much more complicated, but an extra issue to address.)
(which might be the PyObject types, which can hold any Python object type),
Hm. Yes. I can't seem to find these anymore. I seem to recall using type='o' or something in Numeric, but I can't find the right type objects now... (Guess I'm just reading the docs and dir(numeric) poorly...) It would be nice if array(['foo']) just worked. Oh, well.
Just like this:
import Numeric a = Numeric.array(['a','b','c'],typecode=Numeric.PyObject) a array([a , b , c ],'O')
By memory-mapping disc-based versions of the Numeric arrays, and using the BsdDb3 record number database format for the string columns, you can even make a disc-based "record array" which can be larger than available RAM+swap.
Sounds quite useful, although quite similar to MetaKit. (I suppose I could use some functions from numarray on columns in MetaKit... But that might just be too weird -- and it would still just be a collection of columns :])
I really like MetaKit's column-based storage, but it just doesn't scale well (on the author's admission, and verified empirically) - beyond a few 10**5 records, it bogs down terribly, whereas memory-mapped NumPy plus BsdDb3 recno databse for strings scales well to many tens of millions of records (or more, but thats as far as I have tested). Tim C
Tim Churches <tchur@optushome.com.au>: [snip]
Just like this:
import Numeric a = Numeric.array(['a','b','c'],typecode=Numeric.PyObject) a array([a , b , c ],'O')
As you may have noticed from my previous descriptions, I'm using numarray, not Numeric. I've used this in Numeric before -- I just can't find the equivalent functionality in numarray :) [snip]
I really like MetaKit's column-based storage,
Me too.
but it just doesn't scale well (on the author's admission, and verified empirically)
Yes, you're right.
- beyond a few 10**5 records, it bogs down terribly, whereas memory-mapped NumPy plus BsdDb3 recno databse for strings scales well to many tens of millions of records (or more, but thats as far as I have tested).
Impressive! Now this *does* sound interesting... The project I originally posted about only has a few hundred records, so I'm only considering numarray for expressiveness/readability there -- performance is not an issue. But using bsddb and numarray (or Numeric) together like this seems useful in many applications.
Tim C
-- Magnus Lie Hetland http://hetland.org
Magnus Lie Hetland writes:
Tim Churches <tchur@optushome.com.au>: [snip]
Just like this:
import Numeric a = Numeric.array(['a','b','c'],typecode=Numeric.PyObject) a array([a , b , c ],'O')
As you may have noticed from my previous descriptions, I'm using numarray, not Numeric. I've used this in Numeric before -- I just can't find the equivalent functionality in numarray :)
At the moment, PyObject arrays are not supported (mainly because it hasn't been a priority for our needs yet. But if all one needs is such an array to hold PyObjects and nothing more (for example, we envisioned more sophisticated uses such as apply object methods to the array and returning arrays of the results) than associative purposes (and being able to set and get array values), it should be quite easy to add this capability. In fact one could subclass NDArray and just define the _get and _setitem methods (I forget the exact names) and probably customize the __init__ and have the functionality that you need. I can take a look at it next week (or if you feel bold, look at NDArray yourself). As with Numeric, speed is sacrificed when using such arrays. The presumption is that one is using Numeric or numarray on such things mainly for the convenience of the array manipulations, not the kind of efficiency that bulk numerical operations provide. Combining that with RecordArrays may be a bit trickier in the sense that RecordArrays presume that records use the same buffer for all data. If one doesn't mind storing PyObject pointers in that data array, it probably is also fairly simple to extend it (but I frankly haven't thought this through so I may be wrong about how easy it is). Doing this may require some thought about how to pickle such arrays. Of course, one may have a set of arrays as Tim suggests which also acts like a record array where there is no single data buffer. Our RecordArrays were intended to map to data files closely, but other variants are certainly possible. Perry Greenfield
Mensaje citado por: Magnus Lie Hetland <magnus@hetland.org>:
I'm working on some two-dimensional tables of data, where some data are numerical, while other aren't. I'd like to use numarray's numerical capabilities with the numerical parts (columns) while keeping the data in each row together. (I'm sure this generalizes to more dimensions, and to sub-arrays in general, not just rows.)
You may want to have a look at PyTables (http://pytables.sourceforge.net). It's designed to be used in scenarios similar to that you are exposing. It supports Numeric objects and although columns in tables are not automatically converted to Numeric o numarray objects, you can build them on the flight easily using its powerful selection capabilities. It uses HDF5 (http://hdf.ncsa.uiuc.edu/HDF5/) format to save its data, so you can read PyTables files in a variety of languages and platforms. Cheers, Francesc Alted
Mensaje citado por: Perry Greenfield <perry@stsci.edu>:
Our RecordArrays were intended to map to data files closely, but other variants are certainly possible.
In fact, I'm thinking of adopting numarray for my pytables project, but I don't like the fact that data is not natively aligned inside recarrays, i.e. there is not a gap between the different fields even if datatypes doesn't match the "native" architecture alignement. IMO this can affect very much to the read/write efficency when one wants to work with data rows or columns of recarrays objects. Are there any plans to support this "natural" alignment in addition of the non- alignment schema present right now?. Francesc Alted
Francesc Alted <falted@openlc.org>: [snip]
You may want to have a look at PyTables (http://pytables.sourceforge.net). It's designed to be used in scenarios similar to that you are exposing. [snip]
Sounds interesting. I'll look into it. -- Magnus Lie Hetland http://hetland.org
In fact, I'm thinking of adopting numarray for my pytables project, but I don't like the fact that data is not natively aligned inside recarrays, i.e. there is not a gap between the different fields even if datatypes doesn't match the "native" architecture alignement. IMO this can affect very much to the read/write efficency when one wants to work with data rows or columns of recarrays objects.
Are there any plans to support this "natural" alignment in addition of the non- alignment schema present right now?.
Are you asking for an option to create record arrays with aligned fields (in the sense that the addresses of all values are consistent with their type)? Or are you arguing that non-aligned columns must be prohibited? The former is certainly possible (not not very difficult to implement; basically it requires that record sizes must be a multiple of the largest numerical type, and that padding is placed within records to ensure that all fields have offsets that are a multiple of their size). We cannot accept the latter since we need to access data that are stored in such a non-aligned manner in data files.
Francesc Alted
Thanks, Perry Greenfield
Mensaje citado por: Perry Greenfield <perry@stsci.edu>:
Are you asking for an option to create record arrays with aligned fields (in the sense that the addresses of all values are consistent with their type)?
Yes, I'm advocating for that
Or are you arguing that non-aligned columns must be prohibited? The former is certainly possible (not not very difficult to implement; basically it requires that record sizes must be a multiple of the largest numerical type, and that padding is placed within records to ensure that all fields have offsets that are a multiple of their size).
Well, for the sake of keeping the size of dataset to a minimum, I think it's not necessary to adjust all the record field sizes to the largest data type because depending on the type of the field, the padding can be shorter or larger. For example, short ints only needs to be aligned in two-byte basis, while doubles need 4 bytes (or 8, I don't remember well). In any case, this depends on the architecture. But it is still possible to figure out safely what is the required minimum alignments for the different types. Look at Python's struct module for a good example on how you can reduce the padding to a minimum, without sacrificing performance. Francesc Alted
participants (4)
-
Francesc Alted
-
Magnus Lie Hetland
-
Perry Greenfield
-
Tim Churches