
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <Chris.Barker@noaa.gov> wrote:
Hi folks,
This is a continuation of a conversation already started, but i gave it a new, more appropriate, thread and subject.
On 12/6/11 2:13 PM, Wes McKinney wrote:
we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. ...
I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on.
I've wanted this for ages, and have done some work towards it, but like others, only had the time for a my-use-case-specific solution. A few thoughts:
* If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader.
You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. A very fast reader for well-behave files would be very welcome, but I see it as a separate topic from genfromtxt/loadtable. The question for the loadtable pull request is whether it is different enough from genfromtxt that we need/want both, or whether loadtable should replace genfromtxt. Cheers, Ralf
* key to performance is to have the text to number to numpy type happening in C -- if you read the text with python, then convert to numbers, then to numpy arrays, it's simple going to be slow.
* I think we want a solution that can be adapted to arbitrary text files -- not just tabular, CSV-style data. I have a lot of those to read - and some thoughts about how.
Efforts I have made so far, and what I've learned from them:
1) fromfile(): fromfile (for text) is nice and fast, but buggy, and a bit too limited. I've posted various notes about this in the past (and, I'm pretty sure a couple tickets). They key missing features are: a) no support form commented lines (this is a lessor need, I think) b) there can be only one delimiter, and newlines are treated as generic whitespace. What this means is that if you have whitespace-delimited file, you can read multiple lines, but if it is, for instance, comma-delimited, then you can only read one line at a time, killing performance. c) there are various bugs if the text is malformed, or doesn't quite match what you're asking for (ie.e reading integers, but the tet is float) -- mostly really limited error checking.
I spent some time digging into the code, and found it to be really hard to track C code. And very hard to update. The core idea is pretty nice -- each dtype should know how to read itself form a text file, but the implementation is painful. The key issue is that for floats and ints, anyway, it relies on the C atoi and atof functions. However, there have been patches to these that handle NaN better, etc, for numpy, and I think a python patch as well. So the code calls the numpy atoi, which does some checks, then calls the python atoi, which then calls the C lib atoi (I think all that...) In any case, the core bugs are due to the fact that atoi and friends doesn't return an error code, so you have to check if the pointer has been incremented to see if the read was successful -- this error checking is not propagated through all those levels of calls. It got really ugly to try to fix! Also, the use of the C atoi() means that locales may only be handled in the default way -- i.e. no way to read european-style floats on a system with a US locale.
My conclusion -- the current code is too much a mess to try to deal with and fix!
I also think it's a mistake to have text file reading a special case of fromfile(), it really should be a separate issue, though that's a minor API question.
2) FileScanner:
FileScanner is some code a wrote years ago as a C extension - it's limited, but does the job and is pretty fast. It essentially calls fscanf() as many times as it gets a successful scan, skipping all invalid text, then returning a numpy array. You can also specify how many numbers you want read from the file. It only supports floats. Travis O. asked it it could be included in Scipy way back when, but I suspect none of my code actually made it in.
If I had to do it again, I might write something similar in Cython, though I am still using it.
My Conclusions:
I think what we need is something similar to MATLAB's fscanf():
what it does is take a C-style format string, and apply it to your file over an over again as many times as it can, and returns an array. What's nice about this is that it can be purposed to efficiently read a wide variety of text files fast.
For numpy, I imagine something like:
fromtextfile(f, dtype=np.float64, comment=None, shape=None): """ read data from a text file, returning a numpy array
f: is a filename or file-like object
comment: is a string of the comment signifier. Anything on a line after this string will be ignored.
dytpe: is a numpy dtype that you want read from the file
shape: is the shape of the resulting array. If shape==None, the file will be read until EOF or until there is read error. By default, if there are newlines in the file, a 2-d array will be returned, with the newline signifying a new row in the array. """
This is actually pretty straightforward. If it support compound dtypes, then you can read a pretty complex CSV file, once you've determined the dtype for your "record" (row). It is also really simple to use for the simple cases.
But of course, the implementation could be a pain -- I've been thinking that you could get a lot of it by creating a mapping from numpy dtypes to fscanf() format strings, then simply use fscanf for the actual file reading. This would certainly be easy for the easy cases. (maybe you'd want to use sscanf, so you could have the same code scan strings as well as files)
Ideally, each dtype would know how to read itself from a string, but as I said above, the code for that is currently pretty ugly, so it may be easier to keep it separate.
Anyway, I'd be glad to help with this effort.
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion