segfault caused by incorrect Py_DECREF in ufunc
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c) There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html): #PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception. PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault. I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it. Most people do not see this problem because it only affects user defined types. Tom if (userdef > 0) { PyObject *key, *obj; int ret; obj = NULL; key = PyInt_FromLong((long) userdef); if (key == NULL) return 1; obj = PyDict_GetItem(self>userloops, key); Py_DECREF(key); if (obj == NULL) { PyErr_SetString(PyExc_TypeError, "userdefined type used in ufunc" \ " with no registered loops"); return 1; } /* extract the correct function data and argtypes */ ret = _find_matching_userloop(obj, arg_types, scalars, function, data, self>nargs, self>nin);
Py_DECREF(obj);
return ret; }
Tom Denniston wrote:
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c)
There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html):
#PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception.
PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault.
I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it.
Most people do not see this problem because it only affects user defined types.
You are right on with your analysis. Thank you for the test, check, and fix. I've changed it in SVN. Best regards, Travis
Thanks!
On 7/6/07, Travis Oliphant
Tom Denniston wrote:
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c)
There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html):
#PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception.
PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault.
I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it.
Most people do not see this problem because it only affects user defined types.
You are right on with your analysis. Thank you for the test, check, and fix.
I've changed it in SVN.
Best regards,
Travis
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings. I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me. Thanks, Vincent
On 7/6/07, Vincent Nijs
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py
Thanks for the reference John! csv2rec is about 30% faster than my code on
the same data.
If I read the code in csv2rec correctly it converts the data as it is being
read using the csv modules. My setup reads in the whole dataset into an
array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the
width of the largest item. I made it fallback to "object" in this
case.
Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
On 7/8/07, Torgil Svensson
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from the conversion functions. Right now it looks like: imap>chain>convert_row>tuple>generator>izip. That's five levels deep and Python functions are reasonably expensive. I would try to be a lot less clever and do something like: def data_iterator(row_iter, delim): row0 = row_iter.next().split(delim) converters = find_formats(row0) # left as an exercise yield tuple(f(x) for f, x in zip(conversion_functions, row0)) for row in row_iter: yield tuple(f(x) for f, x in zip(conversion_functions, row0)) That's just a sketch and I haven't timed it, but it cuts a few levels out of the call chain, so has a reasonable chance of being faster. If you wanted to be really clever, you could use some exec magic after you figure out the conversion functions to compile a special function that generates the tuples directly without any use of tuple or zip. I don't have time to work through the details right now, but the code you would compile would end up looking this: for (x0, x1, x2) in row_iter: yield (int(x0), float(x1), float(x2)) Here we've assumed that find_formats determined that there are three fields, an int and two floats. Once you have this info you can build an appropriate function and exec it. This would cut another couple levels out of the call chain. Again, I haven't timed it, or tried it, but it looks like it would be fun to try. tim
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 . __ . \ . . tim.hochberg@ieee.org
On 7/8/07, Timothy Hochberg
On 7/8/07, Torgil Svensson
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from the conversion functions. Right now it looks like: imap>chain>convert_row>tuple>generator>izip. That's five levels deep and Python functions are reasonably expensive. I would try to be a lot less clever and do something like:
def data_iterator(row_iter, delim): row0 = row_iter.next().split(delim) converters = find_formats(row0) # left as an exercise yield tuple(f(x) for f, x in zip(conversion_functions, row0)) for row in row_iter: yield tuple(f(x) for f, x in zip(conversion_functions, row0))
That sounds sane. I've maybe been attracted to bad habits here and got away with it since i'm very i/obound in these cases. My main objective has been reducing memory footprint to reduce swapping.
That's just a sketch and I haven't timed it, but it cuts a few levels out of the call chain, so has a reasonable chance of being faster. If you wanted to be really clever, you could use some exec magic after you figure out the conversion functions to compile a special function that generates the tuples directly without any use of tuple or zip. I don't have time to work through the details right now, but the code you would compile would end up looking this:
for (x0, x1, x2) in row_iter: yield (int(x0), float(x1), float(x2))
Here we've assumed that find_formats determined that there are three fields, an int and two floats. Once you have this info you can build an appropriate function and exec it. This would cut another couple levels out of the call chain. Again, I haven't timed it, or tried it, but it looks like it would be fun to try.
tim
Thank you for the lesson! Great tip. This opens up for a variety of new coding options. I've made an attempt on the fun part. Attached are a version that generates the following generator code for Vincent's __main__=='__name__'  code: def get_data_iterator(row_iter,delim): yield (int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) Best Regards, //Torgil
I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr>
convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.
If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
Hi
I stumble on these types of problems from time to time so I'm
interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on
the first row (without decimal separator) but has decimals further
down?
This will be little tricky to support. One solution could be to yield
StopIteration, calculate new typeconversionfunctions and start over
iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to
my response to Tim.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Torgil,
The function seems to work well and is slightly faster than your previous
version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turn
out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
['1','3','1/97','1.12','2.11','0'],
['1','2','3/97','1.21','3.12','0'],
['2','1','2/97','1.12','2.11','0'],
['2','2','4/97','1.33','2.26','1.23'],
['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the
appropriate type. That may not hold if you have missing values or 'mixed
types'.
Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson"
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new typeconversionfunctions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
On 7/8/07, Vincent Nijs
Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turn out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97','1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26','1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent, Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part. If you do need to auto detect, one approach would be to always read both intlike stuff and floatlike stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers. tim Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson"
wrote: Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new typeconversionfunctions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding
On 7/8/07, Vincent Nijs
wrote: the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: > I wrote the attached (small) program to read in a text/csv file with > different data types and convert it into a recarray without having to > prespecify the dtypes or variables names. I am just too lazy to typein > stuff like that :) The supported types are int, float, dates, and > strings. > > I works pretty well but it is not (yet) as fast as I would like so I was > wonder if any of the numpy experts on this list might have some > suggestion > on how to speed it up. I need to read 500MB1GB files so speed is > important > for me. In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 . __ . \ . . tim.hochberg@ieee.org
Tim,
I do want to autodetect. Reading numbers in as floats is probably not a
huge penalty.
Is there an easy way to change the type of one column in a recarray that you
know?
I tried this:
ra.col1 = ra.col1.astype(i¹)
but that didn¹t seem to work. I assume that means you would have to create a
new array from the old one with an updated dtype list.
Thanks,
Vincent
On 7/8/07 4:51 PM, "Timothy Hochberg"
On 7/8/07, Vincent Nijs
wrote: Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turnTim, out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97',' 1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26',' 1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent,
Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part.
If you do need to auto detect, one approach would be to always read both intlike stuff and floatlike stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers.
tim
Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new typeconversionfunctions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: I am not (yet) very familiar with much of the functionality introduced
in
your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int ( i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: >> Given that both your script and the mlab version preloads the whole >> file before calling numpy constructor I'm curious how that compares in >> speed to using numpy's fromiter function on your data. Using fromiter >> should improve on memory usage (~50% ?). >> >> The drawback is for string columns where we don't longer know the >> width of the largest item. I made it fallback to "object" in this >> case. >> >> Attached is a fromiter version of your script. Possible speedups could >> be done by trying different approaches to the "convert_row" function, >> for example using "zip" or "enumerate" instead of "izip". >> >> Best Regards, >> >> //Torgil >> >> >> On 7/8/07, Vincent Nijs
mailto:vnijs@kellogg.northwestern.edu > wrote: >>>> Thanks for the reference John! csv2rec is about 30% faster than my code on >>>> the same data. >>>> >>>> If I read the code in csv2rec correctly it converts the data as it is being >>>> read using the csv modules. My setup reads in the whole dataset into an >>>> array of strings and then converts the columns as appropriate. >>>> >>>> Best, >>>> >>>> Vincent >>>> >>>> >>>> On 7/6/07 8:53 PM, "John Hunter" wrote: >>>> > >>>>> On 7/6/07, Vincent Nijs wrote: >> >>>>>> I wrote the attached (small) program to read in a text/csv file with >> >>>>>> different data types and convert it into a recarray without >> having to >> >>>>>> prespecify the dtypes or variables names. I am just too lazy to >> typein >> >>>>>> stuff like that :) The supported types are int, float, dates, and >> >>>>>> strings. >> >>>>>> >> >>>>>> I works pretty well but it is not (yet) as fast as I would like >> so I was >> >>>>>> wonder if any of the numpy experts on this list might have some >> >>>>>> suggestion >> >>>>>> on how to speed it up. I need to read 500MB1GB files so speed is >> >>>>>> important >> >>>>>> for me. > >>>>> > >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the > >>>>> same. You may want to compare implementations in case we can > >>>>> fruitfully cross pollinate them. In the examples directy, there > is an > >>>>> example script examples/loadrec.py > >>>>> _______________________________________________ > >>>>> Numpydiscussion mailing list > >>>>> Numpydiscussion@scipy.org > >>>>> http://projects.scipy.org/mailman/listinfo/numpydiscussion > >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Numpydiscussion mailing list >>>> Numpydiscussion@scipy.org mailto:Numpydiscussion@scipy.org >>>> http://projects.scipy.org/mailman/listinfo/numpydiscussion >>>> >> _______________________________________________ >> Numpydiscussion mailing list >> Numpydiscussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpydiscussion  Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
FWIW
n,dt=descr[0] new_dt=dt.replace('f','i') descr[0]=(n,new_dt) data=ra.col1.astype(new_dt) ra.dtype=N.dtype(descr) ra.col1=data
//Torgil
On 7/9/07, Vincent Nijs
Tim,
I do want to autodetect. Reading numbers in as floats is probably not a huge penalty.
Is there an easy way to change the type of one column in a recarray that you know?
I tried this:
ra.col1 = ra.col1.astype('i')
but that didn't seem to work. I assume that means you would have to create a new array from the old one with an updated dtype list.
Thanks,
Vincent
On 7/8/07 4:51 PM, "Timothy Hochberg"
wrote: On 7/8/07, Vincent Nijs
wrote: Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turnTim,
out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97',' 1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26',' 1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent,
Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part.
If you do need to auto detect, one approach would be to always read both intlike stuff and floatlike stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers.
tim
Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new typeconversionfunctions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int's (e.g., 0's), but then the rest of that same column could be floats. I guess finding
On 7/8/07, Vincent Nijs
wrote: the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int ( i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
mailto:vnijs@kellogg.northwestern.edu > wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: > I wrote the attached (small) program to read in a text/csv file with > different data types and convert it into a recarray without having to > prespecify the dtypes or variables names. I am just too lazy to typein > stuff like that :) The supported types are int, float, dates, and > strings. > > I works pretty well but it is not (yet) as fast as I would like so I was > wonder if any of the numpy experts on this list might have some > suggestion > on how to speed it up. I need to read 500MB1GB files so speed is > important > for me. In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org mailto:Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
In your case the floats don't pass as ints since you have decimals.
The attached file takes another approach (sorry for lack of comments).
If the conversion fail, the current row is stored and the iterator
exits (without setting a 'finished' parameter to true). The program
then recalculates the conversionfunctions and checks for changes. If
the changes are supported (=we have a conversion function for old data
in the format_changes dictionary) it calls fromiter again with an
iterator like this:
def get_data_iterator(row_iter,delim,res):
for x0,x1,x2,x3,x4,x5 in res['data']:
x0=float(x0)
print (x0,x1,x2,x3,x4,x5)
yield (x0,x1,x2,x3,x4,x5)
yield (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('1.23'))
for row in row_iter:
x0,x1,x2,x3,x4,x5=row.split(delim)
try:
yield
(float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))
except:
res['row']=row
return
res['finished']=True
res['data'] is the previously converted data. This has the obvious
disadvantage that if only the last row has fractions in a column,
it'll cost double memory. Also if many columns change format at
different places it has to reconvert every time.
I don't recommend this because of the drawbacks and extra complexity.
I think it is best to convert your files (or file generation) so that
float columns are represented with 0.0 instead of 0.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Thanks for looking into this Torgil! I agree that this is a much more
complicated setup. I'll check if there is anything I can do on the data end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats
and convert to int later as needed.
Vincent
On 7/8/07 5:40 PM, "Torgil Svensson"
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
In your case the floats don't pass as ints since you have decimals. The attached file takes another approach (sorry for lack of comments). If the conversion fail, the current row is stored and the iterator exits (without setting a 'finished' parameter to true). The program then recalculates the conversionfunctions and checks for changes. If the changes are supported (=we have a conversion function for old data in the format_changes dictionary) it calls fromiter again with an iterator like this:
def get_data_iterator(row_iter,delim,res): for x0,x1,x2,x3,x4,x5 in res['data']: x0=float(x0) print (x0,x1,x2,x3,x4,x5) yield (x0,x1,x2,x3,x4,x5) yield (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float(' 1.23')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) try: yield (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) except: res['row']=row return res['finished']=True
res['data'] is the previously converted data. This has the obvious disadvantage that if only the last row has fractions in a column, it'll cost double memory. Also if many columns change format at different places it has to reconvert every time.
I don't recommend this because of the drawbacks and extra complexity. I think it is best to convert your files (or file generation) so that float columns are represented with 0.0 instead of 0.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson"
wrote: Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fallback to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs
wrote: Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter"
wrote: On 7/6/07, Vincent Nijs
wrote: I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to prespecify the dtypes or variables names. I am just too lazy to typein stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
On 7/8/07, Vincent Nijs
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed. The implementation is a little rough, but it should get the idea across.  . __ . \ . . tim.hochberg@ieee.org ======================================================================== def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0 def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb') if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None info = LoadInfo(f.next()) chunks = [] while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn) chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr)) if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks # load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n" return data
Cool! Thanks Tim.
Vincent
On 7/8/07 10:25 PM, "Timothy Hochberg"
On 7/8/07, Vincent Nijs
wrote: Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
I combined some of the very useful comments/code from Tim and Torgil and cameup with the attached program to read csv files and convert the data into a recarray. I couldn¹t use all of their suggestions because, frankly, I didn¹t understand all of them :) The program use variable names if provided in the csvfile and can autodetect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome. Thanks, Vincent
Nice,
I haven't gone through all details. That's a nice new "missing"
feature, maybe all instances where we can't find a conversion should
be "nan". A few comments:
1. The "load_search" functions contains all memory/performance
overhead that we wanted to avoid with the fromiter function. Does this
mean that you no longer have large textfiles that change sting
representation in the columns (aka "0" floats) ?
2. ident=" "*4
This has the same spelling error as in my first compile try .. it was
meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2))
Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types)
Isn't "reader" an iterator already? What does the "iter()" operator do
in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs
I combined some of the very useful comments/code from Tim and Torgil and cameup with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csvfile and can autodetect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Hi, I am trying to write a couple of simple functions to (1) save recarray's to an sqlite database and (2) load a recarray from an sqllite database. I am stuck on 2 points and hope there are some people on this list that use sqlite for numpy stuff. 1. How to detect the variable names and types from the sqlite database? I am using: conn = sqlite3.connect(fname,detect_types=sqlite3.PARSE_DECLTYPESsqlite3.PARSE_COL NAMES) but then how do you access the variable names and types and convert them to numpy types? 2. In saving the recarray to sqlite I need to get data types from data.dtype.descr and transform the names to types that sqlite knows: string > text int > integer float > real I tried some things like: for i in data[0]: if type(i) == str This didn't work because the elements are numpy.strings and I couldn't get the comparison to work. I'd rather use the dtype descriptions directly but couldn't figure out how to do that either. Any suggestions are very welcome. Thanks!! Vincent
Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between
your code and the more complicated version that I proposed. The idea behind
my code was to minimize memory usage  I figured that keeping the memory
usage low would make up for any inefficiencies in the conversion process
since it's been my experience that memory bandwidth dominates a lot of
numeric problems as problem sized get reasonably large. I was mostly wrong.
While it's true that for very large file sizes I can get my code to
outperform yours, in most instances it lags behind. And the range where it
does better is a fairly small range right before the machine dies with a
memory error. So my conclusion is that the extra hoops my code goes through
to avoid allocating extra memory isn't worth it for you to bother with.²
The approach in my code is simple and robust to most data issues I could
comeup with. It actually will do an appropriate conversion if there are
missing values or int¹s and float in the same column. It will select an
appropriate string length as well. It may not be the most memory efficient
setup but given Tim¹s comments it is a pretty decent solution for the types
of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip,
map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function
to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson"
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large textfiles that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs
wrote: I combined some of the very useful comments/code from Tim and Torgil and cameup with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csvfile and can autodetect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
Hi,
1. Your code is fast due to that you convert whole at once columns in
numpy. The first step with the lists is also very fast (python
implements lists as arrays). I like your version, I think it's as fast
as it gets in pure python and has to keep only two versions of the
data at once in memory (since the string versions can be garbage
collected).
If memory really is an issue, you have the nice "load_spec" version
and can always convert the files once by iterating over the file twice
like the attached script does.
4. Okay, that makes sense. I was confused by the fact that your
generated function had the same name as the builtin iter() operator.
//Torgil
On 7/19/07, Vincent Nijs
Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage  I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with."
The approach in my code is simple and robust to most data issues I could comeup with. It actually will do an appropriate conversion if there are missing values or int's and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim's comments it is a pretty decent solution for the types of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson"
wrote: Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large textfiles that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs
wrote: I combined some of the very useful comments/code from Tim and Torgil
and
cameup with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csvfile and can autodetect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Hi again,
On 7/19/07, Torgil Svensson
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
I discovered that my script was broken and too complex. The attached
script is much cleaner and has better error messages.
Best regards,
//Torgil
On 7/19/07, Torgil Svensson
Hi,
1. Your code is fast due to that you convert whole at once columns in numpy. The first step with the lists is also very fast (python implements lists as arrays). I like your version, I think it's as fast as it gets in pure python and has to keep only two versions of the data at once in memory (since the string versions can be garbage collected).
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
4. Okay, that makes sense. I was confused by the fact that your generated function had the same name as the builtin iter() operator.
//Torgil
On 7/19/07, Vincent Nijs
wrote: Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage  I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with."
The approach in my code is simple and robust to most data issues I could comeup with. It actually will do an appropriate conversion if there are missing values or int's and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim's comments it is a pretty decent solution for the types of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson"
wrote: Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large textfiles that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs
wrote: I combined some of the very useful comments/code from Tim and Torgil
and
cameup with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csvfile and can autodetect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually. I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences. saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass Best, Vincent
On Thu, Jul 19, 2007 at 09:42:42PM 0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipyuser/2007April/011724.html but I see that you participated in it. I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do. So I think the choices between pytables and cPickle boils down to whether you want to share the data with other software than Python or not. Gaël
Gael Varoquaux wrote:
On Thu, Jul 19, 2007 at 09:42:42PM 0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipyuser/2007April/011724.html
but I see that you participated in it.
I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do.
I want that Matlab script! I have colleagues with whom the least common denominator is currently .mat files. I'd be much happier if it was hdf5 files. Can you post it on the scipy wiki cookbook? (Or the pytables wiki?) Cheers! Andrew
On Fri, Jul 20, 2007 at 01:59:13AM 0700, Andrew Straw wrote:
I want that Matlab script!
I new I really should put these things on line, I have just been wanting to iron them a bit, but it has been almost two year since I have touched these, so ... http://scipy.org/Cookbook/hdf5_in_Matlab Feel free to improve them, and to write similar scripts in Python. Gaël
Gael Varoquaux (el 20070720 a les 11:24:34 +0200) va dir::
I new I really should put these things on line, I have just been wanting to iron them a bit, but it has been almost two year since I have touched these, so ...
Wow, that looks really sweet and simple, useful code. Great! :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
Gael,
Sounds very interesting! Would you mind sharing an example (with code if
possible) of how you organize your experimental data in pytables. I have
been thinking about how I might organize my data in pytables and would luv
to hear how an experienced user does that.
Given the speed differences it looks like pytables is going to be a better
solution for my needs.
Still curious however ... does no one on this list use (and like) sqlite?
Could anyone suggest any other list where I might find users of python and
sqlite (and numpy)?
Thanks,
Vincent
On 7/20/07 1:16 AM, "Gael Varoquaux"
On Thu, Jul 19, 2007 at 09:42:42PM 0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipyuser/2007April/011724.html
but I see that you participated in it.
I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do.
So I think the choices between pytables and cPickle boils down to whether you want to share the data with other software than Python or not.
Gaël _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
On 7/20/07, Vincent Nijs
Gael,
Sounds very interesting! Would you mind sharing an example (with code if possible) of how you organize your experimental data in pytables. I have been thinking about how I might organize my data in pytables and would luv to hear how an experienced user does that.
Given the speed differences it looks like pytables is going to be a better solution for my needs.
Still curious however ... does no one on this list use (and like) sqlite?
Could anyone suggest any other list where I might find users of python and sqlite (and numpy)?
You could try the dbsig. You can get to the archives, and I imagine subscribe to it, from: http://www.python.org/community/sigs/current/ I don't know if that'll be helpful for you, but I imagine that they know something about python + sqlllite.  . __ . \ . . tim.hochberg@ieee.org
On Fri, Jul 20, 2007 at 08:35:51AM 0500, Vincent Nijs wrote:
Sounds very interesting! Would you mind sharing an example (with code if possible) of how you organize your experimental data in pytables. I have been thinking about how I might organize my data in pytables and would luv to hear how an experienced user does that.
I can show you the processing code. The experiment I have close to me is run by Matlab, the one that is fully controlled by Python is a continent away. Actually, I am really lazy, so I am just going to copy brutally the IO module. Something that can be interesting is that the data is saved by the expirement control framework on a computer (called Krubcontrol), this data can then be retrieve using the "fetch_files" Python command, that puts it on the server and logs it into a data base like hash table. When we want to retrieve the data we have a special object krubdata, which uses some fancy indexing to retrieve by data, or specifying the keywords. I am sorry I am not providing the code that is writing the hdf5 files, it is an incredible useless mess, trust me. I would be able to factor out the output code out of the 5K matlab lines. Hopefuly you'll be able to get an idea of the structure of the hdf5 files by looking at the code that does the loading. I haven't worked with this data for a while, so I can't tell you Some of the Python code might be useful to others, especially the hashing and retrieving part. The reason why I didn't use a relational DB is that I simply don't trust them enough for my precious data. Gaël
Vincent, A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
Still curious however ... does no one on this list use (and like) sqlite?
First of all, while I'm not a heavy user of relational databases, I've used them as references for benchmarking purposes. Hence, based on my own benchmarking experience, I'd say that, for writing, relational databases do take a lot of safety measures to ensure that all the data that is written to the disk is safe and that the data relationships don't get broken, and that takes times (a lot of time, in fact). I'm not sure about whether some of these safety measures can be relaxed, but even though some relational databases would allow this, my feel (beware, I can be wrong) is that you won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not observing security measures in that regard because they are not thought for these tasks). In this sense, the best writing speed that I was able to achieve with Postgres (I don't know whether sqlite support this) is by simulating that your data comes from a file stream and using the "cursor.copy_from()" method. Using this approach I was able to accelerate a 10x (if I remember well) the injecting speed, but even with this, PyTables can be another 10x faster. You can see an exemple of usage in the Postgres backend [1] used for doing the benchmarks for comparing PyTables and Postgres speeds. Regarding reading speed, my diggins [2] seems to indicate that the bottleneck here is not related with safety, but with the need of the relational databases pythonic APIs of wrapping *every* element retrieved out of the database with a Python container (int, float, string...). On the contrary, PyTables does take advantage of creating an empty recarray as the container to keep all the retrieved data, and that's very fast compared with the former approach. To somewhat quantify this effect in function of the size of the dataset retrieved, you can see the figure 14 of [3] (as you can see, the larger the dataset retrieved, the larger the difference in terms of speed). Incidentally, and as it is said there, I'm hoping that NumPy containers should eventually be discovered by relational database wrappers makers, so these wrapping times would be removed completely, but I'm currently not aware of any package taking this approach. [1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py [2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704 [3] http://www.carabos.com/docs/OPSIindexes.pdf Cheers, 
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
Another small note: I'm pretty sure sqlite stores everything as strings. This just plain has to be slower than storing the raw binary representation (and may mean for slight differences in fp values on the roundtrip). HDF is designed for this sort of thing, sqlite is not. Chris  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
A Divendres 20 Juliol 2007 20:16, Christopher Barker escrigué:
Another small note:
I'm pretty sure sqlite stores everything as strings. This just plain has to be slower than storing the raw binary representation (and may mean for slight differences in fp values on the roundtrip). HDF is designed for this sort of thing, sqlite is not.
Yeah, that was the case with sqlite 2. However, starting with sqlite 3, developers provided the ability to store integer and real numbers in a more compact format [1]. Sqlite 3 is the version included in Python 2.5 (the python version that Vincent was benchmarking), so this shouldn't make a big difference compared with other relational databases. [1] http://www.sqlite.org/datatype3.html Cheers, 
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
FYI
I asked a question about the load and save speed of recarray's using pickle
vs pysqlite on the pysqlite list and got the response linked below. Doesn't
look like sqlite can do much better than what I found.
http://lists.initd.org/pipermail/pysqlite/2007July/001085.html
I also passed on Francesc's idea to use numpy containers in relational
database wrappers such as pysqlite. This is apparently not possible since
"in a "relational database you don't know the type of the values in advance.
Some values might be NULL" and "and you might even have different types for
the same column"
http://lists.initd.org/pipermail/pysqlite/2007July/001087.html
I would assume the NULL's could be treated as missing values (?) Don't know
about the different types in one column however.
Vincent
On 7/20/07 10:53 AM, "Francesc Altet"
Vincent,
A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
Still curious however ... does no one on this list use (and like) sqlite?
First of all, while I'm not a heavy user of relational databases, I've used them as references for benchmarking purposes. Hence, based on my own benchmarking experience, I'd say that, for writing, relational databases do take a lot of safety measures to ensure that all the data that is written to the disk is safe and that the data relationships don't get broken, and that takes times (a lot of time, in fact). I'm not sure about whether some of these safety measures can be relaxed, but even though some relational databases would allow this, my feel (beware, I can be wrong) is that you won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not observing security measures in that regard because they are not thought for these tasks).
In this sense, the best writing speed that I was able to achieve with Postgres (I don't know whether sqlite support this) is by simulating that your data comes from a file stream and using the "cursor.copy_from()" method. Using this approach I was able to accelerate a 10x (if I remember well) the injecting speed, but even with this, PyTables can be another 10x faster. You can see an exemple of usage in the Postgres backend [1] used for doing the benchmarks for comparing PyTables and Postgres speeds.
Regarding reading speed, my diggins [2] seems to indicate that the bottleneck here is not related with safety, but with the need of the relational databases pythonic APIs of wrapping *every* element retrieved out of the database with a Python container (int, float, string...). On the contrary, PyTables does take advantage of creating an empty recarray as the container to keep all the retrieved data, and that's very fast compared with the former approach. To somewhat quantify this effect in function of the size of the dataset retrieved, you can see the figure 14 of [3] (as you can see, the larger the dataset retrieved, the larger the difference in terms of speed). Incidentally, and as it is said there, I'm hoping that NumPy containers should eventually be discovered by relational database wrappers makers, so these wrapping times would be removed completely, but I'm currently not aware of any package taking this approach.
[1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py [2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704 [3] http://www.carabos.com/docs/OPSIindexes.pdf
Cheers,
Vincent Nijs (el 20070722 a les 10:21:18 0500) va dir::
[...] I would assume the NULL's could be treated as missing values (?) Don't know about the different types in one column however.
Maybe a masked array would do the trick, with NULL values masked out. :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences.
saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass
loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform PyTables about the expected number of rows (see [1]) that you will end feeding into the tables so that it can choose the best chunksize for I/O purposes. I've redone the benchmarks (the new script is attached) with this 'optimization' on and here are my numbers: ====================================== PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.3 Zlib version: 1.2.3 LZO version: 2.01 (Jun 27 2005) Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] Platform: linux2x86_64 Byteordering: little ====================================== Test saving recarray using cPickle: 0.197113 sec/pass Test saving recarray with pytables: 0.234442 sec/pass Test saving recarray with pytables (with zlib): 1.973649 sec/pass Test saving recarray with pytables (with lzo): 0.925558 sec/pass Test loading recarray using cPickle: 0.151379 sec/pass Test loading recarray with pytables: 0.165399 sec/pass Test loading recarray with pytables (with zlib): 0.553251 sec/pass Test loading recarray with pytables (with lzo): 0.264417 sec/pass As you can see, the differences between raw cPickle and PyTables are much less than not informing about the total number of rows. In fact, an automatic optimization can easily be done in PyTables so that when the user is passing a recarray, the total length of the recarray would be compared with the default number of expected rows (currently 10000), and if the former is larger, then the length of the recarray should be chosen instead. I also have added the times when using compression just in case you are interested using it. Here are the final file sizes: $ ls sh data total 132M 24M datalzo.h5 43M dataNone.h5 43M data.pickle 25M datazlib.h5 Of course, this is using completely random data, but with real data the compression levels are expected to be higher than this. [1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim Cheers, 
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
Thanks Francesc!
That does work much better:
======================================
PyTables version: 2.0
HDF5 version: 1.6.5
NumPy version: 1.0.4.dev3852
Zlib version: 1.2.3
BZIP2 version: 1.0.2 (30Dec2001)
Python version: 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)]
Platform: darwinPower Macintosh
Byteordering: big
======================================
Test saving recarray using cPickle: 1.620880 sec/pass
Test saving recarray with pytables: 2.074591 sec/pass
Test saving recarray with pytables (with zlib): 14.320498 sec/pass
Test loading recarray using cPickle: 1.023015 sec/pass
Test loading recarray with pytables: 0.882411 sec/pass
Test loading recarray with pytables (with zlib): 3.692698 sec/pass
On 7/20/07 6:17 AM, "Francesc Altet"
A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences.
saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass
loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform PyTables about the expected number of rows (see [1]) that you will end feeding into the tables so that it can choose the best chunksize for I/O purposes.
I've redone the benchmarks (the new script is attached) with this 'optimization' on and here are my numbers:
====================================== PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.3 Zlib version: 1.2.3 LZO version: 2.01 (Jun 27 2005) Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] Platform: linux2x86_64 Byteordering: little ====================================== Test saving recarray using cPickle: 0.197113 sec/pass Test saving recarray with pytables: 0.234442 sec/pass Test saving recarray with pytables (with zlib): 1.973649 sec/pass Test saving recarray with pytables (with lzo): 0.925558 sec/pass
Test loading recarray using cPickle: 0.151379 sec/pass Test loading recarray with pytables: 0.165399 sec/pass Test loading recarray with pytables (with zlib): 0.553251 sec/pass Test loading recarray with pytables (with lzo): 0.264417 sec/pass
As you can see, the differences between raw cPickle and PyTables are much less than not informing about the total number of rows. In fact, an automatic optimization can easily be done in PyTables so that when the user is passing a recarray, the total length of the recarray would be compared with the default number of expected rows (currently 10000), and if the former is larger, then the length of the recarray should be chosen instead.
I also have added the times when using compression just in case you are interested using it. Here are the final file sizes:
$ ls sh data total 132M 24M datalzo.h5 43M dataNone.h5 43M data.pickle 25M datazlib.h5
Of course, this is using completely random data, but with real data the compression levels are expected to be higher than this.
[1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim
Cheers,
 Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 602082001 Phone: +18474914574 Fax: +18474912498 Email: vnijs@kellogg.northwestern.edu Skype: vincentnijs
Elegant solution. Very readable and takes care of row0 nicely.
I want to point out that this is much more efficient than my version
for random/late string representation changes throughout the
conversion but it suffers from 2*n memory footprint and large block
copying if the string rep changes arrives very early on huge datasets.
I think we can't have best of both and Tims solution is better in the
general case.
Maybe "use one_alt if rownumber < xxx else use other_alt" can
finetune performance for some cases. but even ten, with many cols,
it's nearly impossible to know.
//Torgil
On 7/9/07, Timothy Hochberg
On 7/8/07, Vincent Nijs
wrote: Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
 . __ . \ . . tim.hochberg@ieee.org
========================================================================
def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats
class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0
def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True
def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb')
if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None
info = LoadInfo(f.next()) chunks = []
while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols ) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn)
chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr))
if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks
# load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n"
return data
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
On 7/9/07, Torgil Svensson
Elegant solution. Very readable and takes care of row0 nicely.
I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets.
Yep. I think we can't have best of both and Tims solution is better in the
general case.
It probably would not be hard to do a hybrid version. One issue is that one doesn't, in general, know the size of the dataset in advance, so you'd have to use an absolute criteria (less than 100 lines) instead of a relative criteria (less than 20% done). I suppose you could stat the file or something, but that seems like overkill. Maybe "use one_alt if rownumber < xxx else use other_alt" can
finetune performance for some cases. but even ten, with many cols, it's nearly impossible to know.
That sounds sensible. I have an interesting thought on how to this that's a bit hard to describe. I'll try to throw it together and post another version today or tomorrow. //Torgil
On 7/9/07, Timothy Hochberg
wrote: On 7/8/07, Vincent Nijs
wrote: Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the
end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in
way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new
data the dtype.
At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
 . __ . \ . . tim.hochberg@ieee.org
========================================================================
def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats
class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0
def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split (delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split (delim))) except: info.row0 = row else: info.done = True
def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb')
if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None
info = LoadInfo(f.next()) chunks = []
while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols ) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn)
chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr))
if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks
# load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n"
return data
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 . __ . \ . . tim.hochberg@ieee.org
On 7/9/07, Timothy Hochberg
On 7/9/07, Torgil Svensson
wrote: Elegant solution. Very readable and takes care of row0 nicely.
I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets.
Yep.
I think we can't have best of both and Tims solution is better in the
general case.
It probably would not be hard to do a hybrid version. One issue is that one doesn't, in general, know the size of the dataset in advance, so you'd have to use an absolute criteria (less than 100 lines) instead of a relative criteria (less than 20% done). I suppose you could stat the file or something, but that seems like overkill.
Maybe "use one_alt if rownumber < xxx else use other_alt" can
finetune performance for some cases. but even ten, with many cols, it's nearly impossible to know.
That sounds sensible. I have an interesting thought on how to this that's a bit hard to describe. I'll try to throw it together and post another version today or tomorrow.
OK, as promised, here's an approach that rebuilds the array if the format changes as long as the less than 'restart_length' lines have been processed. Otherwise, it uses the old strategy. Perhaps not the most efficient way, but it reuses what I'd already written with minimal changes. It's still pretty rough  once again I didn't bother to polish it. def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0 self.predata = () def data_iterator(lines, converters, delim, info): for x in info.predata: yield x info.predata = () yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True def load2(fname,delim = ',', has_varnm = True, prn_report = True, restart_length=20): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb') if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None info = LoadInfo(f.next()) chunks = [] while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn) if len(chunks) == 1 and len(chunks[0]) < restart_length: info.predata = chunks[0].astype(descr) chunks = [] chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr)) if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks # load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n" return data
participants (11)

Andrew Straw

Christopher Barker

Francesc Altet

Gael Varoquaux

Ivan Vilata i Balaguer

John Hunter

Timothy Hochberg

Tom Denniston

Torgil Svensson

Travis Oliphant

Vincent Nijs