Hi Torgil, 1. I got an email from Tim about this issue: "I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with.² The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int¹s and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim¹s comments it is a pretty decent solution for the types of data I have access to. 2. Fixed the spelling error :) 3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip! 4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user. Best, Vincent On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil and came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs