
Hi again, On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
I discovered that my script was broken and too complex. The attached script is much cleaner and has better error messages. Best regards, //Torgil On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Hi,
1. Your code is fast due to that you convert whole at once columns in numpy. The first step with the lists is also very fast (python implements lists as arrays). I like your version, I think it's as fast as it gets in pure python and has to keep only two versions of the data at once in memory (since the string versions can be garbage collected).
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
4. Okay, that makes sense. I was confused by the fact that your generated function had the same name as the builtin iter() operator.
//Torgil
On 7/19/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with."
The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int's and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim's comments it is a pretty decent solution for the types of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil
and
came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion