Re: [Numpy-discussion] Fast Reading of ASCII files

NOTE: Let's keep this on the list. On Tue, Dec 13, 2011 at 9:19 AM, denis <denis-bz-gg@t-online.de> wrote:
Chris, unified, consistent save / load is a nice goal
1) header lines with date, pwd etc.: "where'd this come from ?"
# (5, 5) svm.py bz/py/ml/svm 2011-12-13 Dec 11:56 -- automatic # 80.6 % correct -- user info 245 39 4 5 26 ...
I'm not sure I understand what you are expecting here: What would be automatic? if itparses a datetime on the header, what would it do with it? But anyway, this seems to me: - very application specific -- this is for the users code to write - not what we are talking about at this point anyway -- I think this discussion is about a lower-level, does-the-simple-things-fast reader -- that may or may not be able to form the basis of a higher-level fuller featured reader.
2) read any CSVs: comma or blank-delimited, with/without column names, a la loadcsv() below
yup -- though the column name reading would be part of a higher-level reader as far as I'm concerned.
3) sparse or masked arrays ?
sparse probably not, that seem pretty domain dependent to me -- though hopefully one could build such a thing on top of the lower level reader. Masked support would be good -- once we're convinced what the future of masked arrays are in numpy. I was thinking that the masked array issue would really be a higher-level feature -- it certainly could be if you need to mask "special value" stype files (i.e. 9999), but we may have to build it into the lower level reader for cases where the mask is specified by non-numerical values -- i.e. there are some met files that use "MM" or some other text, so you can't put it into a numerical array first.
Longterm wishes: beyond the scope of one file <-> one array but essential for larger projects: 1) dicts / dotdicts: Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little files is easy, better than np.savez (Haven't used hdf5, I believe Matlabv7 does.)
2) workflows: has anyone there used visTrails ?
outside of the spec of this thread...
Anyway it seems to me (old grey cynic) that Numpy/scipy developers prefer to code first, spec and doc later. Too pessimistic ?
Well, I think many of us believe in a more agile style approach -- incremental development. But really, as an open source project, it's really about scratching an itch -- so there is usually a spec in mind for the itch at hand. In this case, however, that has been a weakness -- clearly a number of us hav written small solutions to our particular problem at hand, but no we haven't arrived at a more general purpose solution yet. So a bit of spec-ing ahead of time may be called for. On that: I"ve been thinking from teh botom-up -- imaging what I need for the simple case, and how it might apply to more complex cases -- but maybe we should think about this another way: What we're talking about here is really about core software engineering -- optimization. It's easy to write a pure-python simple file parser, and reasonable to write a complex one (genfromtxt) -- the issue is performance -- we need some more C (or Cython) code to really speed it up, but none of us wants to write the complex case code in C. So: genfromtxt is really nice for many of the complex cases. So perhaps another approach is to look at genfromtxt, and see what high performance lower-level functionality we could develop that could make it fast -- then we are done. This actually mirrors exactly what we all usually recommend for python development in general -- write it in Python, then, if it's really not fast enough, write the bottle-neck in C. So where are the bottle necks in genfromtxt? Are there self-contained portions that could be re-written in C/Cython? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 12/13/2011 12:08 PM, Chris Barker wrote:
NOTE:
Let's keep this on the list.
On Tue, Dec 13, 2011 at 9:19 AM, denis <denis-bz-gg@t-online.de <mailto:denis-bz-gg@t-online.de>> wrote:
Chris, unified, consistent save / load is a nice goal
1) header lines with date, pwd etc.: "where'd this come from ?"
# (5, 5) svm.py bz/py/ml/svm 2011-12-13 Dec 11:56 -- automatic # 80.6 % correct -- user info 245 39 4 5 26 ...
I'm not sure I understand what you are expecting here: What would be automatic? if itparses a datetime on the header, what would it do with it? But anyway, this seems to me: - very application specific -- this is for the users code to write - not what we are talking about at this point anyway -- I think this discussion is about a lower-level, does-the-simple-things-fast reader -- that may or may not be able to form the basis of a higher-level fuller featured reader.
2) read any CSVs: comma or blank-delimited, with/without column names, a la loadcsv() below
yup -- though the column name reading would be part of a higher-level reader as far as I'm concerned.
3) sparse or masked arrays ?
sparse probably not, that seem pretty domain dependent to me -- though hopefully one could build such a thing on top of the lower level reader. Masked support would be good -- once we're convinced what the future of masked arrays are in numpy. I was thinking that the masked array issue would really be a higher-level feature -- it certainly could be if you need to mask "special value" stype files (i.e. 9999), but we may have to build it into the lower level reader for cases where the mask is specified by non-numerical values -- i.e. there are some met files that use "MM" or some other text, so you can't put it into a numerical array first.
Longterm wishes: beyond the scope of one file <-> one array but essential for larger projects: 1) dicts / dotdicts: Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little files is easy, better than np.savez (Haven't used hdf5, I believe Matlabv7 does.)
2) workflows: has anyone there used visTrails ?
outside of the spec of this thread...
Anyway it seems to me (old grey cynic) that Numpy/scipy developers prefer to code first, spec and doc later. Too pessimistic ?
Well, I think many of us believe in a more agile style approach -- incremental development. But really, as an open source project, it's really about scratching an itch -- so there is usually a spec in mind for the itch at hand. In this case, however, that has been a weakness -- clearly a number of us hav written small solutions to our particular problem at hand, but no we haven't arrived at a more general purpose solution yet. So a bit of spec-ing ahead of time may be called for.
On that:
I"ve been thinking from teh botom-up -- imaging what I need for the simple case, and how it might apply to more complex cases -- but maybe we should think about this another way:
What we're talking about here is really about core software engineering -- optimization. It's easy to write a pure-python simple file parser, and reasonable to write a complex one (genfromtxt) -- the issue is performance -- we need some more C (or Cython) code to really speed it up, but none of us wants to write the complex case code in C. So:
genfromtxt is really nice for many of the complex cases. So perhaps another approach is to look at genfromtxt, and see what high performance lower-level functionality we could develop that could make it fast -- then we are done.
This actually mirrors exactly what we all usually recommend for python development in general -- write it in Python, then, if it's really not fast enough, write the bottle-neck in C.
So where are the bottle necks in genfromtxt? Are there self-contained portions that could be re-written in C/Cython?
-Chris
Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder! Both loadtxt and genfromtxt functions (other functions are perhaps less important) perhaps need an upgrade to incorporate the new NA object. I think that adding the NA object will simply some of the process because invalid data (missing or a string in a numerical format) can be set to NA without requiring the creation of a new masked array or returning an error. Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format. So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format (http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for NA object, skipping lines and user-defined delimiters. Bruce

On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey <bsouthey@gmail.com> wrote:
** Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder!
yup Both loadtxt and genfromtxt functions (other functions are perhaps less
important) perhaps need an upgrade to incorporate the new NA object.
yes, if we are satisfiedthat the new NA object is, in fact, the way of the future.
Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format.
So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format ( http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for NA object, skipping lines and user-defined delimiters.
I think that ideally, there could be one interface to reading tabular data -- hopefully, it would be easy for the user to specify what the want, and if they don't the code tries to figure it out. Also, under the hood, the "easy" cases are special-cased to high-performing versions. genfromtxt sure looks close for an API -- it just needs the "high performance special cases" under the hood. It may be that the way it's designed makes it very difficult to do that, though -- I haven't looked closely enough to tell. At least that's what I'm thinking at the moment. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Dec 13, 2011 at 10:07 PM, Chris Barker <chris.barker@noaa.gov>wrote:
On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey <bsouthey@gmail.com>wrote:
** Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder!
yup
Both loadtxt and genfromtxt functions (other functions are perhaps less
important) perhaps need an upgrade to incorporate the new NA object.
yes, if we are satisfiedthat the new NA object is, in fact, the way of the future.
Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format.
So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format ( http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for NA object, skipping lines and user-defined delimiters.
I think that ideally, there could be one interface to reading tabular data -- hopefully, it would be easy for the user to specify what the want, and if they don't the code tries to figure it out. Also, under the hood, the "easy" cases are special-cased to high-performing versions.
genfromtxt sure looks close for an API
This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side. Ralf
-- it just needs the "high performance special cases" under the hood. It may be that the way it's designed makes it very difficult to do that, though -- I haven't looked closely enough to tell.
At least that's what I'm thinking at the moment.
-Chris

On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers <ralf.gommers@googlemail.com>wrote:
genfromtxt sure looks close for an API
This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side.
well, yes, though it does do a lot -- do you have a smpler one in mind? But anyway, the really simple cases, are reallly simle, even with genfromtxt. I guess it's a matter of debate about what is a better API: a few functions, each adding a layer of sophistication or one function, with layers of sophistication added with an array of keyword arguments. In either case, though I wish the multiple functionality built on the same, well optimized core code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 12/14/2011 01:03 AM, Chris Barker wrote:
On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers <ralf.gommers@googlemail.com <mailto:ralf.gommers@googlemail.com>> wrote:
genfromtxt sure looks close for an API
This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side.
well, yes, though it does do a lot -- do you have a smpler one in mind?
But anyway, the really simple cases, are reallly simle, even with genfromtxt.
I guess it's a matter of debate about what is a better API:
a few functions, each adding a layer of sophistication
or
one function, with layers of sophistication added with an array of keyword arguments.
In either case, though I wish the multiple functionality built on the same, well optimized core code.
-Chris
I am not sure that you can even create a simple API here as even Python's csv module is rather complex especially when it just reads data as strings. It also 'hides' many arguments in the Dialect class although these are just the collection of 7 'fmtparam' arguments. It also provides the Sniffer class that tries to find correct format that can then be passed to the reader function. Then you still have to convert the data into the required types - another set of arguments as well as yet another pass through the data. In comparison, genfromtxt can perform sniffing and both genfromtxt and loadtxt can read and convert the data. These also add some useful features like skipping rows (start, end and commented) and columns. However, it could be possible to create a sniffer function and a single data reader function leading to a 'simple' reader function but that probably would not change the API of the underlying data reader function. Bruce

On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey <bsouthey@gmail.com> wrote:
** On 12/14/2011 01:03 AM, Chris Barker wrote:
On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers <ralf.gommers@googlemail.com
wrote:
genfromtxt sure looks close for an API
This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side.
well, yes, though it does do a lot -- do you have a smpler one in mind?
Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get.
Note that I don't think this should be changed now, that's not worth the trouble.
But anyway, the really simple cases, are reallly simle, even with genfromtxt.
I guess it's a matter of debate about what is a better API:
a few functions, each adding a layer of sophistication
or
one function, with layers of sophistication added with an array of keyword arguments.
There's always a trade-off, but looking at the docstring for genfromtxt should make it an easy call in this case.
In either case, though I wish the multiple functionality built on the same, well optimized core code.
I wish that too, but I'm fairly certain that you can't write that core code with the ability to handle missing and irregular data and make it close to the same speed as an optimized reader for regular data.
I am not sure that you can even create a simple API here as even Python's
csv module is rather complex especially when it just reads data as strings. It also 'hides' many arguments in the Dialect class although these are just the collection of 7 'fmtparam' arguments. It also provides the Sniffer class that tries to find correct format that can then be passed to the reader function. Then you still have to convert the data into the required types - another set of arguments as well as yet another pass through the data.
In comparison, genfromtxt can perform sniffing
I assume you mean the ``dtype=None`` example in the docstring? That works to some extent, but you still need to specify the delimiter. I commented on that on the loadtable PR.
and both genfromtxt and loadtxt can read and convert the data. These also add some useful features like skipping rows (start, end and commented) and columns. However, it could be possible to create a sniffer function and a single data reader function leading to a 'simple' reader function but that probably would not change the API of the underlying data reader function.
Better auto-detection of things like delimiters would indeed be quite useful. Ralf

On Wed, Dec 14, 2011 at 1:22 PM, Ralf Gommers <ralf.gommers@googlemail.com>wrote:
On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey <bsouthey@gmail.com> wrote:
** On 12/14/2011 01:03 AM, Chris Barker wrote:
On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers < ralf.gommers@googlemail.com> wrote:
genfromtxt sure looks close for an API
This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side.
well, yes, though it does do a lot -- do you have a smpler one in mind?
Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get.
Just my two cents (and I was one of those who championed its inclusion), the ndmin feature is designed to prevent unexpected results that users (particularly beginners) may encounter with their datasets. Now, maybe it might be difficult to tell a beginner *why* they might need to be aware of it, but it is very easy to describe *how* to use. "How many dimensions is your data? Two? Ok, just set ndmin=2 and you are good to go!" Cheers! Ben Root

On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root <ben.root@ou.edu> wrote:
well, yes, though it does do a lot -- do you have a smpler one in mind?
Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get.
this may be a function of a well written doc string -- if it is clear to the newbie that " all the rest of this you don't need unless you have a wierd data file", then extra keyword arguments don't really hurt. A few examples of the basic use-cases go a long way. And yes, the core reader for the complex cases isn't going to e fast (it's going to be complex C code...). but we could still have a core reader that handled most cases. Anyway, I think it's time write code, and see if it can be rolled in somehow... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Wed, Dec 14, 2011 at 9:54 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root <ben.root@ou.edu> wrote:
well, yes, though it does do a lot -- do you have a smpler one in mind?
Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get.
this may be a function of a well written doc string -- if it is clear to the newbie that " all the rest of this you don't need unless you have a wierd data file", then extra keyword arguments don't really hurt.
A few examples of the basic use-cases go a long way.
And yes, the core reader for the complex cases isn't going to e fast (it's going to be complex C code...). but we could still have a core reader that handled most cases.
Okay, now we're on the same page I think.
Anyway, I think it's time write code, and see if it can be rolled in somehow...
Agreed.
Ralf
participants (4)
-
Benjamin Root
-
Bruce Southey
-
Chris Barker
-
Ralf Gommers