Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

8 Jul 2007

      On 7/8/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
...
On 7/8/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
...
Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the
width of the largest item. I made it fall-back to "object" in this
case.
Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from
the conversion functions. Right now it looks like:
imap->chain->convert_row->tuple->generator->izip. That's
five levels deep and Python functions are reasonably expensive. I would try
to be a lot less clever and do something like:
def data_iterator(row_iter, delim):
        row0 = row_iter.next().split(delim)
        converters = find_formats(row0) # left as an exercise
        yield tuple(f(x) for f, x in zip(conversion_functions, row0))
        for row in row_iter:
            yield tuple(f(x) for f, x in zip(conversion_functions, row0))
That sounds sane. I've maybe been attracted to bad habits here and got
away with it since i'm very i/o-bound in these cases. My main
objective has been reducing memory footprint to reduce swapping.
...
That's just a sketch and I haven't timed it, but it cuts a few levels out of
the call chain, so has a reasonable chance of being faster. If you wanted to
be really clever, you could use some exec magic after you figure out the
conversion functions to compile a special function that generates the tuples
directly without any use of tuple or zip. I don't have time to work through
the details right now, but the code you would compile would end up looking
this:
for (x0, x1, x2) in row_iter:
   yield (int(x0), float(x1), float(x2))
Here we've assumed that find_formats determined that there are three fields,
an int and two floats. Once you have this info you can build an appropriate
function and exec it. This would cut another couple levels out of the call
chain. Again, I haven't timed it, or tried it, but it looks like it would be
fun to try.
-tim
Thank you for the lesson!  Great tip. This opens up for a variety of
new coding options. I've made an attempt on the fun part. Attached are
a version that generates the following generator code for Vincent's
__main__=='__name__' - code:

def get_data_iterator(row_iter,delim):
    yield (int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2'))
    for row in row_iter:
        x0,x1,x2,x3,x4,x5=row.split(delim)
        yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))

Best Regards,

//Torgil