Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

8 Jul 2007

      Hi

I stumble on these types of problems from time to time so I'm
interested in efficient solutions myself.

Do you have a column which starts with something suitable for int on
the first row (without decimal separator) but has decimals further
down?

This will be little tricky to support. One solution could be to yield
StopIteration, calculate new type-conversion-functions and start over
iterating over both the old data and the rest of the iterator.

It'd be great if you could try the load_gen_iter.py I've attached to
my response to Tim.

Best Regards,

//Torgil

On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
...
I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr>
    convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.
If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
...
Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the
width of the largest item. I made it fall-back to "object" in this
case.
Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
...
Thanks for the reference John! csv2rec is about 30% faster than my code on
the same data.
If I read the code in csv2rec correctly it converts the data as it is being
read using the csv modules. My setup reads in the whole dataset into an
array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
...
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
...
I wrote the attached (small) program to read in a text/csv file with
different data types and convert it into a recarray without having to
pre-specify the dtypes or variables names. I am just too lazy to type-in
stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was
wonder if any of the numpy experts on this list might have some suggestion
on how to speed it up. I need to read 500MB-1GB files so speed is important
for me.
In matplotlib.mlab svn, there is a function csv2rec that does the
same.  You may want to compare implementations in case we can
fruitfully cross pollinate them.  In the examples directy, there is an
example script examples/loadrec.py
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
--
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs@kellogg.northwestern.edu
Skype: vincentnijs
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion