[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Sun Jul 8 13:08:16 EDT 2007

On 7/8/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
>
> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
>
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
>
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".

I suspect that you'd do better here if you removed a bunch of layers from
the conversion functions. Right now it looks like:
imap->chain->convert_row->tuple->generator->izip. That's five levels deep
and Python functions are reasonably expensive. I would try to be a lot less
clever and do something like:

    def data_iterator(row_iter, delim):
        row0 = row_iter.next().split(delim)
        converters = find_formats(row0) # left as an exercise
        yield tuple(f(x) for f, x in zip(conversion_functions, row0))
        for row in row_iter:
            yield tuple(f(x) for f, x in zip(conversion_functions, row0))

That's just a sketch and I haven't timed it, but it cuts a few levels out of
the call chain, so has a reasonable chance of being faster. If you wanted to
be really clever, you could use some exec magic after you figure out the
conversion functions to compile a special function that generates the tuples
directly without any use of tuple or zip. I don't have time to work through
the details right now, but the code you would compile would end up looking
this:

for (x0, x1, x2) in row_iter:
   yield (int(x0), float(x1), float(x2))

Here we've assumed that find_formats determined that there are three fields,
an int and two floats. Once you have this info you can build an appropriate
function and exec it. This would cut another couple levels out of the call
chain. Again, I haven't timed it, or tried it, but it looks like it would be
fun to try.

-tim

>
>
> On 7/8/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
> > Thanks for the reference John! csv2rec is about 30% faster than my code
> on
> > the same data.
> >
> > If I read the code in csv2rec correctly it converts the data as it is
> being
> > read using the csv modules. My setup reads in the whole dataset into an
> > array of strings and then converts the columns as appropriate.
> >
> > Best,
> >
> > Vincent
> >
> >
> > On 7/6/07 8:53 PM, "John Hunter" <jdh2358 at gmail.com> wrote:
> >
> > > On 7/6/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
> > >> I wrote the attached (small) program to read in a text/csv file with
> > >> different data types and convert it into a recarray without having to
> > >> pre-specify the dtypes or variables names. I am just too lazy to
> type-in
> > >> stuff like that :) The supported types are int, float, dates, and
> strings.
> > >>
> > >> I works pretty well but it is not (yet) as fast as I would like so I
> was
> > >> wonder if any of the numpy experts on this list might have some
> suggestion
> > >> on how to speed it up. I need to read 500MB-1GB files so speed is
> important
> > >> for me.
> > >
> > > In matplotlib.mlab svn, there is a function csv2rec that does the
> > > same.  You may want to compare implementations in case we can
> > > fruitfully cross pollinate them.  In the examples directy, there is an
> > > example script examples/loadrec.py
> > > _______________________________________________
> > > Numpy-discussion mailing list
> > > Numpy-discussion at scipy.org
> > > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> > >
> >
> >
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
>

-- 
.  __
.   |-\
.
.  tim.hochberg at ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070708/d14aa8ec/attachment.html>