[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Sun Jul 8 19:11:58 EDT 2007

Thanks for looking into this Torgil! I agree that this is a much more
complicated setup. I'll check if there is anything I can do on the data end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats
and convert to int later as needed.

Vincent

On 7/8/07 5:40 PM, "Torgil Svensson" <torgil.svensson at gmail.com> wrote:

>> Question: If you do ignore the int's initially, once the rec array is in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer' if
>> you really want to preserve the int's.
> 
> In your case the floats don't pass as ints since you have decimals.
> The attached file takes another approach (sorry for lack of comments).
> If the conversion fail, the current row is stored and the iterator
> exits (without setting a 'finished' parameter to true). The program
> then re-calculates the conversion-functions and checks for changes. If
> the changes are supported (=we have a conversion function for old data
> in the format_changes dictionary) it calls fromiter again with an
> iterator like this:
> 
> def get_data_iterator(row_iter,delim,res):
>     for x0,x1,x2,x3,x4,x5 in res['data']:
>         x0=float(x0)
>         print (x0,x1,x2,x3,x4,x5)
>         yield (x0,x1,x2,x3,x4,x5)
>     yield 
> (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('
> 1.23'))
>     for row in row_iter:
>         x0,x1,x2,x3,x4,x5=row.split(delim)
>         try:
>             yield
> (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))
>         except:
>             res['row']=row
>             return
>     res['finished']=True
> 
> res['data'] is the previously converted data. This has the obvious
> disadvantage that if only the last row has fractions in a column,
> it'll cost double memory. Also if many columns change format at
> different places it has to re-convert every time.
> 
> I don't recommend this because of the drawbacks and extra complexity.
> I think it is best to convert your files (or file generation) so that
> float columns are represented with 0.0 instead of 0.
> 
> Best Regards,
> 
> //Torgil
> 
> On 7/8/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
>> I am not (yet) very familiar with much of the functionality introduced in
>> your script Torgil (izip, imap, etc.), but I really appreciate you taking
>> the time to look at this!
>> 
>> The program stopped with the following error:
>> 
>>   File "load_iter.py", line 48, in <genexpr>
>>     convert_row=lambda r: tuple(fn(x) for fn,x in
>> izip(conversion_functions,r))
>> ValueError: invalid literal for int() with base 10: '2174.875'
>> 
>> A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
>> but then the rest of that same column could be floats. I guess finding the
>> right conversion function is the tricky part. I was thinking about sampling
>> each, say, 10th obs to test which function to use. Not sure how that would
>> work however.
>> 
>> If I ignore the option of an int (i.e., everything is a float, date, or
>> string) then your script is about twice as fast as mine!!
>> 
>> Question: If you do ignore the int's initially, once the rec array is in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer' if
>> you really want to preserve the int's.
>> 
>> Thanks again!
>> 
>> Vincent
>> 
>> 
>> On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson at gmail.com> wrote:
>> 
>>> Given that both your script and the mlab version preloads the whole
>>> file before calling numpy constructor I'm curious how that compares in
>>> speed to using numpy's fromiter function on your data. Using fromiter
>>> should improve on memory usage (~50% ?).
>>> 
>>> The drawback is for string columns where we don't longer know the
>>> width of the largest item. I made it fall-back to "object" in this
>>> case.
>>> 
>>> Attached is a fromiter version of your script. Possible speedups could
>>> be done by trying different approaches to the "convert_row" function,
>>> for example using "zip" or "enumerate" instead of "izip".
>>> 
>>> Best Regards,
>>> 
>>> //Torgil
>>> 
>>> 
>>> On 7/8/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
>>>> Thanks for the reference John! csv2rec is about 30% faster than my code on
>>>> the same data.
>>>> 
>>>> If I read the code in csv2rec correctly it converts the data as it is being
>>>> read using the csv modules. My setup reads in the whole dataset into an
>>>> array of strings and then converts the columns as appropriate.
>>>> 
>>>> Best,
>>>> 
>>>> Vincent
>>>> 
>>>> 
>>>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358 at gmail.com> wrote:
>>>> 
>>>>> On 7/6/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
>>>>>> I wrote the attached (small) program to read in a text/csv file with
>>>>>> different data types and convert it into a recarray without having to
>>>>>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>>>>>> stuff like that :) The supported types are int, float, dates, and
>>>>>> strings.
>>>>>> 
>>>>>> I works pretty well but it is not (yet) as fast as I would like so I was
>>>>>> wonder if any of the numpy experts on this list might have some
>>>>>> suggestion
>>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is
>>>>>> important
>>>>>> for me.
>>>>> 
>>>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>>>> same.  You may want to compare implementations in case we can
>>>>> fruitfully cross pollinate them.  In the examples directy, there is an
>>>>> example script examples/loadrec.py
>>>>> _______________________________________________
>>>>> Numpy-discussion mailing list
>>>>> Numpy-discussion at scipy.org
>>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Numpy-discussion mailing list
>>>> Numpy-discussion at scipy.org
>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>> 
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion at scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> 
>> --
>> Vincent R. Nijs
>> Assistant Professor of Marketing
>> Kellogg School of Management, Northwestern University
>> 2001 Sheridan Road, Evanston, IL 60208-2001
>> Phone: +1-847-491-4574 Fax: +1-847-491-2498
>> E-mail: v-nijs at kellogg.northwestern.edu
>> Skype: vincentnijs
>> 
>> 
>> 
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> 
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs at kellogg.northwestern.edu
Skype: vincentnijs