[Numpy-discussion] Possible roadmap addendum: building better text file readers

Wed Mar 7 10:49:47 EST 2012

On Tue, Mar 6, 2012 at 4:45 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <jayvius at gmail.com> wrote:
>
> > 1. Loading text files using loadtxt/genfromtxt need a significant
> > performance boost (I think at least an order of magnitude increase in
> > performance is very doable based on what I've seen with Erin's recfile
> code)
>
> > 2. Improved memory usage. Memory used for reading in a text file
> shouldn’t
> > be more than the file itself, and less if only reading a subset of file.
>
> > 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> > etc). No new ones.
>
> > 4. Underlying code should keep IO iteration and transformation of data
> > separate (awaiting more thoughts from Travis on this).
>
> > 5. Be able to plug in different transformations of data at low level
> (also
> > awaiting more thoughts from Travis).
>
> > 6. memory mapping of text files?
>
> > 7. Eventually reduce memory usage even more by using same object for
> > duplicate values in array (depends on implementing enum dtype?)
>
> > Anything else?
>
> Yes -- I'd like to see the solution be able to do high -performance
> reads of a portion of a file -- not always the whole thing. I seem to
> have a number of custom text files that I need to read that are laid
> out in chunks: a bit of a header, then a block of number, another
> header, another block. I'm happy to read and parse the header sections
> with pure pyton, but would love a way to read the blocks of numbers
> into a numpy array fast. This will probably come out of the box with
> any of the proposed solutions, as long as they start at the current
> position of a passes-in fiel object, and can be told how much to read,
> then leave the file pointer in the correct position.
>
>

If you are setup with Cython to build extension modules, and you don't mind
testing an unreleased and experimental reader, you can try the text reader
that I'm working on: https://github.com/WarrenWeckesser/textreader

You can read a file like this, where the first line gives the number of
rows of the following array, and that pattern repeats:

5
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0
10.0, 11.0, 12.0
13.0, 14.0, 15.0
3
1.0, 1.5, 2.0, 2.5
3.0, 3.5, 4.0, 4.5
5.0, 5.5, 6.0, 6.5
1
1.0D2, 1.25D-1, 6.25D-2, 99

with code like this:

import numpy as np
from textreader import readrows

filename = 'data/multi.dat'

f = open(filename, 'r')
line = f.readline()
while len(line) > 0:
    nrows = int(line)
    a = readrows(f, np.float32, numrows=nrows, sci='D', delimiter=',')
    print "a:"
    print a
    print
    line = f.readline()

Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120307/9ca012e1/attachment.html>