Hi there,
We are writing to announce the release of "Tabular", a package of Python modules for working with tabular data.
Tabular is a package of Python modules for working with tabular data. Its main object is the tabarray class, a data structure for holding and manipulating tabular data. By putting data into a tabarray object, you’ll get a representation of the data that is more flexible and powerful than a native Python representation. More specifically, tabarray provides:
-- ultra-fast filtering, selection, and numerical analysis methods, using convenient Matlab-style matrix operation syntax -- spreadsheet-style operations, including row & column operations, 'sort', 'replace', 'aggregate', 'pivot', and 'join' -- flexible load and save methods for a variety of file formats, including delimited text (CSV), binary, and HTML -- helpful inference algorithms for determining formatting parameters and data types of input files -- support for hierarchical groupings of columns, both as data structures and file formats
You can download Tabular from PyPI (http://pypi.python.org/pypi/tabular/http://pypi.python.org/pypi/tabular/) or alternatively clone our hg repository from bitbucket ( http://bitbucket.org/elaine/tabular/ http://bitbucket.org/elaine/tabular/). We also have posted tutorial-style Sphinx documentation ( http://www.parsemydata.com/tabular/).
The tabarray object is based on the record arrayhttp://docs.scipy.org/doc/numpy/reference/generated/numpy.recarray.html?highlight=recarray#numpy.recarrayobject from the Numerical Python package ( NumPy http://numpy.scipy.org/), and Tabular is built to interface well with NumPy in general. Our intended audience is two-fold: (1) Python users who, though they may not be familiar with NumPy, are in need of a way to work with tabular data, and (2) NumPy users who would like to do spreadsheet-style operations on top of their more "numerical" work.
We hope that some of you find Tabular useful!
Best,
Elaine and Dan
Ciao Elaine, I just quickly browsed through your code. Say, what's the reason behind using np.recarrays instead of just standard ndarrays (with flexible dtype). Do you really need the overhead of accessing fields as attributes ? It looks like you're always accessing fields as items... Cheers P.
On Oct 5, 2009, at 5:22 PM, Elaine Angelino wrote:
Hi there,
We are writing to announce the release of "Tabular", a package of Python modules for working with tabular data.
Tabular is a package of Python modules for working with tabular data. Its main object is the tabarray class, a data structure for holding and manipulating tabular data. By putting data into a tabarray object, you’ll get a representation of the data that is more flexible and powerful than a native Python representation. More specifically, tabarray provides:
-- ultra-fast filtering, selection, and numerical analysis methods, using convenient Matlab-style matrix operation syntax -- spreadsheet-style operations, including row & column operations, 'sort', 'replace', 'aggregate', 'pivot', and 'join' -- flexible load and save methods for a variety of file formats, including delimited text (CSV), binary, and HTML -- helpful inference algorithms for determining formatting parameters and data types of input files -- support for hierarchical groupings of columns, both as data structures and file formats
You can download Tabular from PyPI (http://pypi.python.org/pypi/tabular/ ) or alternatively clone our hg repository from bitbucket (http://bitbucket.org/elaine/tabular/ ). We also have posted tutorial-style Sphinx documentation (http://www.parsemydata.com/tabular/ ).
The tabarray object is based on the record array object from the Numerical Python package (NumPy), and Tabular is built to interface well with NumPy in general. Our intended audience is two-fold: (1) Python users who, though they may not be familiar with NumPy, are in need of a way to work with tabular data, and (2) NumPy users who would like to do spreadsheet-style operations on top of their more "numerical" work.
We hope that some of you find Tabular useful!
Best,
Elaine and Dan
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
hey pierre -- good question. this is something we debated a while ago (we actually sent a couple of emails over the numpy list about this very topic) when coming up with our design. at the time, there did not seem to be strong opinions either way about using ndarray vs. recarray
the main reason we went with the recarray over the ndarray is because the recarray has a couple of useful construction functions (e.g. np.rec.fromrecords and np.rec.fromarrays). not only are these functions convenient to use, they have nice data type inference properties which we'd have to rebuild ourselves if we wanted to avoid recarrays entirely.
It would be fairly straightforward to switch from recarray to ndarray if this were really an important thing to do (e.g. if recarray were being deprecated or if most NumPy people have strong feelings about this), and doing so wouldn't modify anything about the tabarray API.
elaine
On Mon, Oct 5, 2009 at 5:47 PM, Pierre GM pgmdevlist@gmail.com wrote:
Ciao Elaine, I just quickly browsed through your code. Say, what's the reason behind using np.recarrays instead of just standard ndarrays (with flexible dtype). Do you really need the overhead of accessing fields as attributes ? It looks like you're always accessing fields as items... Cheers P.
On Oct 5, 2009, at 5:22 PM, Elaine Angelino wrote:
Hi there,
We are writing to announce the release of "Tabular", a package of Python modules for working with tabular data.
Tabular is a package of Python modules for working with tabular data. Its main object is the tabarray class, a data structure for holding and manipulating tabular data. By putting data into a tabarray object, you’ll get a representation of the data that is more flexible and powerful than a native Python representation. More specifically, tabarray provides:
-- ultra-fast filtering, selection, and numerical analysis methods, using convenient Matlab-style matrix operation syntax -- spreadsheet-style operations, including row & column operations, 'sort', 'replace', 'aggregate', 'pivot', and 'join' -- flexible load and save methods for a variety of file formats, including delimited text (CSV), binary, and HTML -- helpful inference algorithms for determining formatting parameters and data types of input files -- support for hierarchical groupings of columns, both as data structures and file formats
You can download Tabular from PyPI (http://pypi.python.org/pypi/tabular/ ) or alternatively clone our hg repository from bitbucket (
http://bitbucket.org/elaine/tabular/
). We also have posted tutorial-style Sphinx documentation (
http://www.parsemydata.com/tabular/
).
The tabarray object is based on the record array object from the Numerical Python package (NumPy), and Tabular is built to interface well with NumPy in general. Our intended audience is two-fold: (1) Python users who, though they may not be familiar with NumPy, are in need of a way to work with tabular data, and (2) NumPy users who would like to do spreadsheet-style operations on top of their more "numerical" work.
We hope that some of you find Tabular useful!
Best,
Elaine and Dan
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Oct 5, 2009 at 17:16, Elaine Angelino elaine.angelino@gmail.com wrote:
hey pierre -- good question. this is something we debated a while ago (we actually sent a couple of emails over the numpy list about this very topic) when coming up with our design. at the time, there did not seem to be strong opinions either way about using ndarray vs. recarray
the main reason we went with the recarray over the ndarray is because the recarray has a couple of useful construction functions (e.g. np.rec.fromrecords and np.rec.fromarrays). not only are these functions convenient to use, they have nice data type inference properties which we'd have to rebuild ourselves if we wanted to avoid recarrays entirely.
Try np.rec.fromrecords(...).view(np.ndarray).
Most likely, we should have versions of those functions that return plain ndarrays. They are quite useful.
Perhaps
def fromarrays(..., type=None): ... if type is not None: _array = _array.view(type) return _array
On Mon, Oct 5, 2009 at 6:36 PM, Robert Kern robert.kern@gmail.com wrote:
the main reason we went with the recarray over the ndarray is because the recarray has a couple of useful construction functions (e.g. np.rec.fromrecords and np.rec.fromarrays). not only are these functions convenient to use, they have nice data type inference properties which
we'd
have to rebuild ourselves if we wanted to avoid recarrays entirely.
Try np.rec.fromrecords(...).view(np.ndarray).
Hi Robert, thanks your email. We definitely understand this use of .view(). However, our question is, should we have implemented tabular this way, e.g. in the tabarray constructor, first make a recarray and then view it as an ndarray? (and then of course view it as a tabarray). This would have the effect of eliminating the extra recarray functionality, and some if its overhead as well. Is this the desirable design, or should we stick with recarrays?
(Also, is first casting to recarrays and then viewing as ndarrays more expensive than if we went through ndarray directly?)
Most likely, we should have versions of those functions that return plain ndarrays. They are quite useful.
Perhaps
def fromarrays(..., type=None): ... if type is not None: _array = _array.view(type) return _array
Yes, we definitely agree with you that there should be plain ndarray versions of the fromarrays and fromrecords constructors. The only reason we didn't include a function like your "fromarrays" function in tabular is that we thought it might be a bit hackish for our package, and seemed like something to be addressed by numpy directly, perhaps at a later time. This was especially given that it didn't seem like people hated recarrays especially.
In the event that people really think we should switch "tabular" from using ndarrays to recarrays, we would definitely support a discussion of adding these kinds of constructors directly to ndarrays.
Thanks Elaine
On Mon, Oct 5, 2009 at 17:52, Elaine Angelino elaine.angelino@gmail.com wrote:
On Mon, Oct 5, 2009 at 6:36 PM, Robert Kern robert.kern@gmail.com wrote:
the main reason we went with the recarray over the ndarray is because the recarray has a couple of useful construction functions (e.g. np.rec.fromrecords and np.rec.fromarrays). not only are these functions convenient to use, they have nice data type inference properties which we'd have to rebuild ourselves if we wanted to avoid recarrays entirely.
Try np.rec.fromrecords(...).view(np.ndarray).
Hi Robert, thanks your email. We definitely understand this use of .view(). However, our question is, should we have implemented tabular this way, e.g. in the tabarray constructor, first make a recarray and then view it as an ndarray? (and then of course view it as a tabarray).
Do the minimum number of .view()s that you can get away with.
This would have the effect of eliminating the extra recarray functionality, and some if its overhead as well. Is this the desirable design, or should we stick with recarrays?
Well, what other recarray functionality are you using? I addressed the from*() functions because you said it was the main reason. What are your other reasons?
(Also, is first casting to recarrays and then viewing as ndarrays more expensive than if we went through ndarray directly?)
The overhead should be miniscule. No data is converted.
Do the minimum number of .view()s that you can get away with.
I guess our bottom line is that we're still not 100% clear as to the recommendation of the NumPy community regarding whether we should use recarray or ndarray. It seems like recarray has some advantages (e.g. the nice inference functions/constructors, and the fact that some people like the ability to fields as attributes) as well as some disadvantages (e.g. the overhead).
it definitely wouldn't be much difficulty to convert tabular to using ndarrays, but is it very desirable? Of course if we were to do this, having recarray-style constructors for ndarrays directly in Numpy would be seem to be a "cleaner" way to do things than either writing our own ndarray versions or casting from recarray to ndarray, but we're happy to do either if changing tabular to ndarray is really desirable.
Well, what other recarray functionality are you using?
None, in our code. We also thought that since at least some people like using the attribute reference property, perhaps users of tabarrays might too (though we don't personally in our own work) Recarrays still seemed to be being supported by NumPy, so it seemed to make sense to use them. but the only functional thing in our code are those constructors.
(Also, is first casting to recarrays and then viewing as ndarrays more expensive than if we went through ndarray directly?)
But if NumPy decided to include ndarray versions of the from*() constructors in the distribution, would this be achieved by first using the recarray constructor and then viewing as ndarray? Or would something more "direct" be done?
thanks, e
On Mon, Oct 5, 2009 at 18:15, Elaine Angelino elaine.angelino@gmail.com wrote:
Well, what other recarray functionality are you using?
None, in our code. We also thought that since at least some people like using the attribute reference property, perhaps users of tabarrays might too (though we don't personally in our own work) Recarrays still seemed to be being supported by NumPy, so it seemed to make sense to use them. but the only functional thing in our code are those constructors.
Then I would suggest making tabarrays subclass from ndarray. If you like, provide a tabrecarray that subclasses from both recarray and tabarray so that people who like attribute access can .view() to their heart's content.
(Also, is first casting to recarrays and then viewing as ndarrays more expensive than if we went through ndarray directly?)
But if NumPy decided to include ndarray versions of the from*() constructors in the distribution, would this be achieved by first using the recarray constructor and then viewing as ndarray? Or would something more "direct" be done?
We would fix the functions to not do any unnecessary .view()s.
On 10/05/2009 06:20 PM, Robert Kern wrote:
On Mon, Oct 5, 2009 at 18:15, Elaine Angelinoelaine.angelino@gmail.com wrote:
Well, what other recarray functionality are you using?
None, in our code. We also thought that since at least some people like using the attribute reference property, perhaps users of tabarrays might too (though we don't personally in our own work) Recarrays still seemed to be being supported by NumPy, so it seemed to make sense to use them. but the only functional thing in our code are those constructors.
Then I would suggest making tabarrays subclass from ndarray. If you like, provide a tabrecarray that subclasses from both recarray and tabarray so that people who like attribute access can .view() to their heart's content.
(Also, is first casting to recarrays and then viewing as ndarrays more expensive than if we went through ndarray directly?)
But if NumPy decided to include ndarray versions of the from*() constructors in the distribution, would this be achieved by first using the recarray constructor and then viewing as ndarray? Or would something more "direct" be done?
We would fix the functions to not do any unnecessary .view()s.
Hi Elaine, I do want to look more at what you have done as some of the features are very interesting.
This discussion raises the question of what do you find missing in numpy that you have included in tabular package? In particular is there a particular set of functions that you think could be added to numpy or even create a 'better' recarray class? There are real advantages of having at least core components in numpy.
Bruce
On Mon, Oct 5, 2009 at 7:20 PM, Robert Kern robert.kern@gmail.com wrote:
On Mon, Oct 5, 2009 at 18:15, Elaine Angelino elaine.angelino@gmail.com wrote:
Then I would suggest making tabarrays subclass from ndarray.
Ok, done. We did it using the from*() function design you suggested. In the future, if there are more direct from*() functions working directly on ndarrays we'd want to switch to those of course.
While implementing the change, we were reminded of another difference between ndarray and recarray, namely that the constructor of ndarray doesn't accept "names" or "formats" parameters while the recarray constructor does (e.g. you have to specify `dtype` in the ndarray constructor). This feature of the recarray constructor was useful for our purposes, since one of the goals of tabular is providing 'easy' construction methods. We've retained this feature, even though we've switched to subclassing ndarray.
There must be a good reason why ndarray does not accept "names" or "formats" parameters and forces the use of the more explicit and unambiguous "dtype". I guess it's "cleaner" in some sense, since the formats parameter is necessarily more limited. It does make sense to have a strongly unambiguous interface for a cornerstone method like np.ndarray.__new__.
That said, I think it also makes sense to have more flexible interfaces too, even if they're sometimes more ambiguous (this is part of the purpose of tabular, see http://www.parsemydata.com/tabular/reference/organization.html#design-philos... ).
Thanks for the help,
elaine
On Mon, Oct 5, 2009 at 5:22 PM, Elaine Angelino elaine.angelino@gmail.com wrote:
Hi there,
We are writing to announce the release of "Tabular", a package of Python modules for working with tabular data.
Tabular is a package of Python modules for working with tabular data. Its main object is the tabarray class, a data structure for holding and manipulating tabular data. By putting data into a tabarray object, you’ll get a representation of the data that is more flexible and powerful than a native Python representation. More specifically, tabarray provides:
-- ultra-fast filtering, selection, and numerical analysis methods, using convenient Matlab-style matrix operation syntax -- spreadsheet-style operations, including row & column operations, 'sort', 'replace', 'aggregate', 'pivot', and 'join' -- flexible load and save methods for a variety of file formats, including delimited text (CSV), binary, and HTML -- helpful inference algorithms for determining formatting parameters and data types of input files -- support for hierarchical groupings of columns, both as data structures and file formats
You can download Tabular from PyPI (http://pypi.python.org/pypi/tabular/) or alternatively clone our hg repository from bitbucket (http://bitbucket.org/elaine/tabular/).%C2%A0 We also have posted tutorial-style Sphinx documentation (http://www.parsemydata.com/tabular/).
The tabarray object is based on the record array object from the Numerical Python package (NumPy), and Tabular is built to interface well with NumPy in general. Our intended audience is two-fold: (1) Python users who, though they may not be familiar with NumPy, are in need of a way to work with tabular data, and (2) NumPy users who would like to do spreadsheet-style operations on top of their more "numerical" work.
We hope that some of you find Tabular useful!
Best,
Elaine and Dan
I briefly looked at the sphinx docs and the code. Tabular looks pretty useful and the code can be partially read as recipes for working with recarrays or structured arrays. Thanks for the choice of license (it makes looking at the code "legal").
I didn't see any explicit nan handling. Are missing values allowed e.g. in the constructor?
I looked a bit closer at function like tabular.fast.recarrayisin since I always have problems with these row operations. Are these function supposed to work with arbitrary structured arrays? The tests are only for a 1d integer arrays. With floats the default string representation doesn't sort correctly. Or am I misreading the function?
arr = np.array([6,1,2,1e-13,0.5*1e-14,1,2e25,3,0,7]).view([('',float)]*2) arr
array([(6.0, 1.0), (2.0, 1e-013), (5e-015, 1.0), (2.0000000000000002e+025, 3.0), (0.0, 7.0)], dtype=[('f0', '<f8'), ('f1', '<f8')])
np.sort([str(l) for l in arr])
array(['(0.0, 7.0)', '(2.0, 1e-013)', '(2.0000000000000002e+025, 3.0)', '(5e-015, 1.0)', '(6.0, 1.0)'], dtype='|S30')
Being able to do a searchsorted on rows of an array would be a useful feature in numpy. Is there a sortable 1d representation of the rows of a 2d float or mixed type array?
Thanks,
Josef
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Oct 6, 2009 at 12:31 PM, josef.pktd@gmail.com wrote:
On Mon, Oct 5, 2009 at 5:22 PM, Elaine Angelino elaine.angelino@gmail.com wrote:
Hi there,
We are writing to announce the release of "Tabular", a package of Python modules for working with tabular data.
Tabular is a package of Python modules for working with tabular data. Its main object is the tabarray class, a data structure for holding and manipulating tabular data. By putting data into a tabarray object, you’ll get a representation of the data that is more flexible and powerful than a native Python representation. More specifically, tabarray provides:
-- ultra-fast filtering, selection, and numerical analysis methods, using convenient Matlab-style matrix operation syntax -- spreadsheet-style operations, including row & column operations, 'sort', 'replace', 'aggregate', 'pivot', and 'join' -- flexible load and save methods for a variety of file formats, including delimited text (CSV), binary, and HTML -- helpful inference algorithms for determining formatting parameters and data types of input files -- support for hierarchical groupings of columns, both as data structures and file formats
You can download Tabular from PyPI (http://pypi.python.org/pypi/tabular/) or alternatively clone our hg repository from bitbucket (http://bitbucket.org/elaine/tabular/).%C2%A0 We also have posted tutorial-style Sphinx documentation (http://www.parsemydata.com/tabular/).
The tabarray object is based on the record array object from the Numerical Python package (NumPy), and Tabular is built to interface well with NumPy in general. Our intended audience is two-fold: (1) Python users who, though they may not be familiar with NumPy, are in need of a way to work with tabular data, and (2) NumPy users who would like to do spreadsheet-style operations on top of their more "numerical" work.
We hope that some of you find Tabular useful!
Best,
Elaine and Dan
I briefly looked at the sphinx docs and the code. Tabular looks pretty useful and the code can be partially read as recipes for working with recarrays or structured arrays. Thanks for the choice of license (it makes looking at the code "legal").
I didn't see any explicit nan handling. Are missing values allowed e.g. in the constructor?
I looked a bit closer at function like tabular.fast.recarrayisin since I always have problems with these row operations. Are these function supposed to work with arbitrary structured arrays? The tests are only for a 1d integer arrays. With floats the default string representation doesn't sort correctly. Or am I misreading the function?
arr = np.array([6,1,2,1e-13,0.5*1e-14,1,2e25,3,0,7]).view([('',float)]*2) arr
array([(6.0, 1.0), (2.0, 1e-013), (5e-015, 1.0), (2.0000000000000002e+025, 3.0), (0.0, 7.0)], dtype=[('f0', '<f8'), ('f1', '<f8')])
np.sort([str(l) for l in arr])
array(['(0.0, 7.0)', '(2.0, 1e-013)', '(2.0000000000000002e+025, 3.0)', '(5e-015, 1.0)', '(6.0, 1.0)'], dtype='|S30')
Maybe this doesn't matter for the purpose of this function. I will download and try the code before I make any more irrelevant comments.
Josef
Being able to do a searchsorted on rows of an array would be a useful feature in numpy. Is there a sortable 1d representation of the rows of a 2d float or mixed type array?
Thanks,
Josef
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I didn't see any explicit nan handling. Are missing values allowed e.g. in the constructor?
No, this is a valid point. We don't handle this as explicitly as we should. Are you mostly talking about nan handling in loading from delimited text files? (Or are you talking about something more general, like integration of masked arrays?) In loading from delimited text files, you can use the "linefixer" and "valuefixer" arguments, which are for more general purposes, and which will get the job done, but slowly. We should do something more specialized for missing values that would be faster.
Are these function supposed to work with arbitrary structured arrays?
Well, they're only really tested for working with strings, floats, and ints (tho only the int tests are included in the test module, we should expand that). I imagine it's possible they'd work with more sophisticated things but I'm not sure.
arr =
np.array([6,1,2,1e-13,0.5*1e-14,1,2e25,3,0,7]).view([('',float)]*2)
arr
array([(6.0, 1.0), (2.0, 1e-013), (5e-015, 1.0), (2.0000000000000002e+025, 3.0), (0.0, 7.0)], dtype=[('f0', '<f8'), ('f1', '<f8')])
np.sort([str(l) for l in arr])
array(['(0.0, 7.0)', '(2.0, 1e-013)', '(2.0000000000000002e+025, 3.0)', '(5e-015, 1.0)', '(6.0, 1.0)'], dtype='|S30')
Well on this example (as in tests that we did), fast.recarrayisin performed
as spec'd. ... But definitely write back again if you think it's failing somewhere.
In general, extending a number of the thigns in Tabular (e.g. the loadSV and saveSV) to arbitrary structured dtypes as opposed to more basic types would be great.
Dan