[Numpy-discussion] Fast Reading of ASCII files

Wed Dec 7 13:50:14 EST 2011

Hi folks,

This is a continuation of a conversation already started, but i gave it 
a new, more appropriate, thread and subject.

On 12/6/11 2:13 PM, Wes McKinney wrote:
> we should start talking
> about building a *high performance* flat file loading solution with
> good column type inference and sensible defaults, etc.
...

>  I personally don't
> believe in sacrificing an order of magnitude of performance in the 90%
> case for the 10% case-- so maybe it makes sense to have two functions
> around: a superfast custom CSV reader for well-behaved data, and a
> slower, but highly flexible, function like loadtable to fall back on.

I've wanted this for ages, and have done some work towards it, but like 
others, only had the time for a my-use-case-specific solution. A few 
thoughts:

* If we have a good, fast ascii (or unicode?) to array reader, hopefully 
it could be leveraged for use in the more complex cases. So that rather 
than genfromtxt() being written from scratch, it would be a wrapper 
around the lower-level reader.

* key to performance is to have the text to number to numpy type 
happening in C -- if you read the text with python, then convert to 
numbers, then to numpy arrays, it's simple going to be slow.

* I think we want a solution that can be adapted to arbitrary text files 
-- not just tabular, CSV-style data. I have a lot of those to read - and 
some thoughts about how.

Efforts I have made so far, and what I've learned from them:

1) fromfile():
     fromfile (for text) is nice and fast, but buggy, and a bit too 
limited. I've posted various notes about this in the past (and, I'm 
pretty sure a couple tickets). They key missing features are:
   a) no support form commented lines (this is a lessor need, I think)
   b) there can be only one delimiter, and newlines are treated as 
generic whitespace. What this means is that if you have 
whitespace-delimited file, you can read multiple lines, but if it is, 
for instance, comma-delimited, then you can only read one line at a 
time, killing performance.
   c) there are various bugs if the text is malformed, or doesn't quite 
match what you're asking for (ie.e reading integers, but the tet is 
float) -- mostly really limited error checking.

I spent some time digging into the code, and found it to be really hard 
to track C code. And very hard to update. The core idea is pretty nice 
-- each dtype should know how to read itself form a text file, but the 
implementation is painful. The key issue is that for floats and ints, 
anyway, it relies on the C atoi and atof functions. However, there have 
been patches to these that handle NaN better, etc, for numpy, and I 
think a python patch as well. So the code calls the numpy atoi, which 
does some checks, then calls the python atoi, which then calls the C lib 
atoi (I think all that...) In any case, the core bugs are due to the 
fact that atoi and friends doesn't return an error code, so you have to 
check if the pointer has been incremented to see if the read was 
successful -- this error checking is not propagated through all those 
levels of calls. It got really ugly to try to fix! Also, the use of the 
C atoi() means that locales may only be handled in the default way -- 
i.e. no way to read european-style floats on a system with a US locale.

My conclusion -- the current code is too much a mess to try to deal with 
and fix!

I also think it's a mistake to have text file reading a special case of 
fromfile(), it really should be a separate issue, though that's a minor 
API question.

2) FileScanner:

FileScanner is some code a wrote years ago as a C extension - it's 
limited, but does the job and is pretty fast. It essentially calls 
fscanf() as many times as it gets a successful scan, skipping all 
invalid text, then returning a numpy array. You can also specify how 
many numbers you want read from the file. It only supports floats. 
Travis O. asked it it could be included in Scipy way back when, but I 
suspect none of my code actually made it in.

If I had to do it again, I might write something similar in Cython, 
though I am still using it.

My Conclusions:

I think what we need is something similar to MATLAB's fscanf():

what it does is take a C-style format string, and apply it to your file 
over an over again as many times as it can, and returns an array. What's 
nice about this is that it can be purposed to efficiently read a wide 
variety of text files fast.

For numpy, I imagine something like:

fromtextfile(f, dtype=np.float64, comment=None, shape=None):
    """
    read data from a text file, returning a numpy array

    f: is a filename or file-like object

    comment: is a string of the comment signifier. Anything on a line
             after this string will be ignored.

    dytpe: is a numpy dtype that you want read from the file

    shape: is the shape of the resulting array. If shape==None, the
           file will be read until EOF or until there is read error.
           By default, if there are newlines in the file, a 2-d array
           will be returned, with the newline signifying a new row in
           the array.
    """

This is actually pretty straightforward. If it support compound dtypes, 
then you can read a pretty complex CSV file, once you've determined the 
dtype for your "record" (row). It is also really simple to use for the 
simple cases.

But of course, the implementation could be a pain -- I've been thinking 
that you could get a lot of it by creating a mapping from numpy dtypes 
to fscanf() format strings, then simply use fscanf for the actual file 
reading. This would certainly be easy for the easy cases. (maybe you'd 
want to use sscanf, so you could have the same code scan strings as well 
as files)

Ideally, each dtype would know how to read itself from a string, but as 
I said above, the code for that is currently pretty ugly, so it may be 
easier to keep it separate.

Anyway, I'd be glad to help with this effort.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov