[Numpy-discussion] Fast Reading of ASCII files
Chris.Barker
Chris.Barker at noaa.gov
Wed Dec 7 13:50:14 EST 2011
Hi folks,
This is a continuation of a conversation already started, but i gave it
a new, more appropriate, thread and subject.
On 12/6/11 2:13 PM, Wes McKinney wrote:
> we should start talking
> about building a *high performance* flat file loading solution with
> good column type inference and sensible defaults, etc.
...
> I personally don't
> believe in sacrificing an order of magnitude of performance in the 90%
> case for the 10% case-- so maybe it makes sense to have two functions
> around: a superfast custom CSV reader for well-behaved data, and a
> slower, but highly flexible, function like loadtable to fall back on.
I've wanted this for ages, and have done some work towards it, but like
others, only had the time for a my-use-case-specific solution. A few
thoughts:
* If we have a good, fast ascii (or unicode?) to array reader, hopefully
it could be leveraged for use in the more complex cases. So that rather
than genfromtxt() being written from scratch, it would be a wrapper
around the lower-level reader.
* key to performance is to have the text to number to numpy type
happening in C -- if you read the text with python, then convert to
numbers, then to numpy arrays, it's simple going to be slow.
* I think we want a solution that can be adapted to arbitrary text files
-- not just tabular, CSV-style data. I have a lot of those to read - and
some thoughts about how.
Efforts I have made so far, and what I've learned from them:
1) fromfile():
fromfile (for text) is nice and fast, but buggy, and a bit too
limited. I've posted various notes about this in the past (and, I'm
pretty sure a couple tickets). They key missing features are:
a) no support form commented lines (this is a lessor need, I think)
b) there can be only one delimiter, and newlines are treated as
generic whitespace. What this means is that if you have
whitespace-delimited file, you can read multiple lines, but if it is,
for instance, comma-delimited, then you can only read one line at a
time, killing performance.
c) there are various bugs if the text is malformed, or doesn't quite
match what you're asking for (ie.e reading integers, but the tet is
float) -- mostly really limited error checking.
I spent some time digging into the code, and found it to be really hard
to track C code. And very hard to update. The core idea is pretty nice
-- each dtype should know how to read itself form a text file, but the
implementation is painful. The key issue is that for floats and ints,
anyway, it relies on the C atoi and atof functions. However, there have
been patches to these that handle NaN better, etc, for numpy, and I
think a python patch as well. So the code calls the numpy atoi, which
does some checks, then calls the python atoi, which then calls the C lib
atoi (I think all that...) In any case, the core bugs are due to the
fact that atoi and friends doesn't return an error code, so you have to
check if the pointer has been incremented to see if the read was
successful -- this error checking is not propagated through all those
levels of calls. It got really ugly to try to fix! Also, the use of the
C atoi() means that locales may only be handled in the default way --
i.e. no way to read european-style floats on a system with a US locale.
My conclusion -- the current code is too much a mess to try to deal with
and fix!
I also think it's a mistake to have text file reading a special case of
fromfile(), it really should be a separate issue, though that's a minor
API question.
2) FileScanner:
FileScanner is some code a wrote years ago as a C extension - it's
limited, but does the job and is pretty fast. It essentially calls
fscanf() as many times as it gets a successful scan, skipping all
invalid text, then returning a numpy array. You can also specify how
many numbers you want read from the file. It only supports floats.
Travis O. asked it it could be included in Scipy way back when, but I
suspect none of my code actually made it in.
If I had to do it again, I might write something similar in Cython,
though I am still using it.
My Conclusions:
I think what we need is something similar to MATLAB's fscanf():
what it does is take a C-style format string, and apply it to your file
over an over again as many times as it can, and returns an array. What's
nice about this is that it can be purposed to efficiently read a wide
variety of text files fast.
For numpy, I imagine something like:
fromtextfile(f, dtype=np.float64, comment=None, shape=None):
"""
read data from a text file, returning a numpy array
f: is a filename or file-like object
comment: is a string of the comment signifier. Anything on a line
after this string will be ignored.
dytpe: is a numpy dtype that you want read from the file
shape: is the shape of the resulting array. If shape==None, the
file will be read until EOF or until there is read error.
By default, if there are newlines in the file, a 2-d array
will be returned, with the newline signifying a new row in
the array.
"""
This is actually pretty straightforward. If it support compound dtypes,
then you can read a pretty complex CSV file, once you've determined the
dtype for your "record" (row). It is also really simple to use for the
simple cases.
But of course, the implementation could be a pain -- I've been thinking
that you could get a lot of it by creating a mapping from numpy dtypes
to fscanf() format strings, then simply use fscanf for the actual file
reading. This would certainly be easy for the easy cases. (maybe you'd
want to use sscanf, so you could have the same code scan strings as well
as files)
Ideally, each dtype would know how to read itself from a string, but as
I said above, the code for that is currently pretty ugly, so it may be
easier to keep it separate.
Anyway, I'd be glad to help with this effort.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list