segfault caused by incorrect Py_DECREF in ufunc
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c) There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html): #PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception. PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault. I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it. Most people do not see this problem because it only affects user defined types. --Tom if (userdef > 0) { PyObject *key, *obj; int ret; obj = NULL; key = PyInt_FromLong((long) userdef); if (key == NULL) return -1; obj = PyDict_GetItem(self->userloops, key); Py_DECREF(key); if (obj == NULL) { PyErr_SetString(PyExc_TypeError, "user-defined type used in ufunc" \ " with no registered loops"); return -1; } /* extract the correct function data and argtypes */ ret = _find_matching_userloop(obj, arg_types, scalars, function, data, self->nargs, self->nin);
Py_DECREF(obj);
return ret; }
Tom Denniston wrote:
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c)
There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html):
#PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception.
PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault.
I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it.
Most people do not see this problem because it only affects user defined types.
You are right on with your analysis. Thank you for the test, check, and fix. I've changed it in SVN. Best regards, -Travis
Thanks! On 7/6/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Tom Denniston wrote:
Below is the code around line 900 for ufuncobject.c (http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/ufuncobject.c)
There is a decref labeled with ">>>" below that is incorrect. As per the python documentation (http://docs.python.org/api/dictObjects.html):
#PyObject* PyDict_GetItem( PyObject *p, PyObject *key) # #Return value: Borrowed reference. #Return the object from dictionary p which has a key key. Return NULL if the key #key is not present, but without setting an exception.
PyDict_GetItem returns a borrowed reference. Therefore this code does not own the contents to which the obj pointer points and should not decref on it. Simply removing the Py_DECREF(obj) line gets rid of the segfault.
I was wondering if someone could confirm that my interpretation is correct and remove the line. I don't have access to the svn or know how to change it.
Most people do not see this problem because it only affects user defined types.
You are right on with your analysis. Thank you for the test, check, and fix.
I've changed it in SVN.
Best regards,
-Travis
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings. I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me. Thanks, Vincent
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data. If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate. Best, Vincent On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?). The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case. Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip". Best Regards, //Torgil On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On 7/8/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from the conversion functions. Right now it looks like: imap->chain->convert_row->tuple->generator->izip. That's five levels deep and Python functions are reasonably expensive. I would try to be a lot less clever and do something like: def data_iterator(row_iter, delim): row0 = row_iter.next().split(delim) converters = find_formats(row0) # left as an exercise yield tuple(f(x) for f, x in zip(conversion_functions, row0)) for row in row_iter: yield tuple(f(x) for f, x in zip(conversion_functions, row0)) That's just a sketch and I haven't timed it, but it cuts a few levels out of the call chain, so has a reasonable chance of being faster. If you wanted to be really clever, you could use some exec magic after you figure out the conversion functions to compile a special function that generates the tuples directly without any use of tuple or zip. I don't have time to work through the details right now, but the code you would compile would end up looking this: for (x0, x1, x2) in row_iter: yield (int(x0), float(x1), float(x2)) Here we've assumed that find_formats determined that there are three fields, an int and two floats. Once you have this info you can build an appropriate function and exec it. This would cut another couple levels out of the call chain. Again, I haven't timed it, or tried it, but it looks like it would be fun to try. -tim
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
On 7/8/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
On 7/8/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from the conversion functions. Right now it looks like: imap->chain->convert_row->tuple->generator->izip. That's five levels deep and Python functions are reasonably expensive. I would try to be a lot less clever and do something like:
def data_iterator(row_iter, delim): row0 = row_iter.next().split(delim) converters = find_formats(row0) # left as an exercise yield tuple(f(x) for f, x in zip(conversion_functions, row0)) for row in row_iter: yield tuple(f(x) for f, x in zip(conversion_functions, row0))
That sounds sane. I've maybe been attracted to bad habits here and got away with it since i'm very i/o-bound in these cases. My main objective has been reducing memory footprint to reduce swapping.
That's just a sketch and I haven't timed it, but it cuts a few levels out of the call chain, so has a reasonable chance of being faster. If you wanted to be really clever, you could use some exec magic after you figure out the conversion functions to compile a special function that generates the tuples directly without any use of tuple or zip. I don't have time to work through the details right now, but the code you would compile would end up looking this:
for (x0, x1, x2) in row_iter: yield (int(x0), float(x1), float(x2))
Here we've assumed that find_formats determined that there are three fields, an int and two floats. Once you have this info you can build an appropriate function and exec it. This would cut another couple levels out of the call chain. Again, I haven't timed it, or tried it, but it looks like it would be fun to try.
-tim
Thank you for the lesson! Great tip. This opens up for a variety of new coding options. I've made an attempt on the fun part. Attached are a version that generates the following generator code for Vincent's __main__=='__name__' - code: def get_data_iterator(row_iter,delim): yield (int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) Best Regards, //Torgil
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this! The program stopped with the following error: File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875' A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however. If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!! Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's. Thanks again! Vincent On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
Hi I stumble on these types of problems from time to time so I'm interested in efficient solutions myself. Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down? This will be little tricky to support. One solution could be to yield StopIteration, calculate new type-conversion-functions and start over iterating over both the old data and the rest of the iterator. It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim. Best Regards, //Torgil On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Torgil, The function seems to work well and is slightly faster than your previous version (about 1/6th faster). Yes, I do have columns that start with, what looks like, int's and then turn out to be floats. Something like below (col6). data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97','1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26','1.23'], ['2','2','5/97','1.73','2.42','1.26']] I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'. Best, Vincent On 7/8/07 3:31 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new type-conversion-functions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turn out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97','1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26','1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent, Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part. If you do need to auto detect, one approach would be to always read both int-like stuff and float-like stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers. -tim Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new type-conversion-functions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote: the
right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote: > I wrote the attached (small) program to read in a text/csv file with > different data types and convert it into a recarray without having to > pre-specify the dtypes or variables names. I am just too lazy to type-in > stuff like that :) The supported types are int, float, dates, and > strings. > > I works pretty well but it is not (yet) as fast as I would like so I was > wonder if any of the numpy experts on this list might have some > suggestion > on how to speed it up. I need to read 500MB-1GB files so speed is > important > for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
Tim, I do want to auto-detect. Reading numbers in as floats is probably not a huge penalty. Is there an easy way to change the type of one column in a recarray that you know? I tried this: ra.col1 = ra.col1.astype(i¹) but that didn¹t seem to work. I assume that means you would have to create a new array from the old one with an updated dtype list. Thanks, Vincent On 7/8/07 4:51 PM, "Timothy Hochberg" <tim.hochberg@ieee.org> wrote:
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turnTim, out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97',' 1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26',' 1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent,
Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part.
If you do need to auto detect, one approach would be to always read both int-like stuff and float-like stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers.
-tim
Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new type-conversion-functions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I am not (yet) very familiar with much of the functionality introduced
in
your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int ( i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
>> Given that both your script and the mlab version preloads the whole >> file before calling numpy constructor I'm curious how that compares in >> speed to using numpy's fromiter function on your data. Using fromiter >> should improve on memory usage (~50% ?). >> >> The drawback is for string columns where we don't longer know the >> width of the largest item. I made it fall-back to "object" in this >> case. >> >> Attached is a fromiter version of your script. Possible speedups could >> be done by trying different approaches to the "convert_row" function, >> for example using "zip" or "enumerate" instead of "izip". >> >> Best Regards, >> >> //Torgil >> >> >> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu <mailto:v-nijs@kellogg.northwestern.edu> > wrote: >>>> Thanks for the reference John! csv2rec is about 30% faster than my code on >>>> the same data. >>>> >>>> If I read the code in csv2rec correctly it converts the data as it is being >>>> read using the csv modules. My setup reads in the whole dataset into an >>>> array of strings and then converts the columns as appropriate. >>>> >>>> Best, >>>> >>>> Vincent >>>> >>>> >>>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote: >>>> > >>>>> On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote: >> >>>>>> I wrote the attached (small) program to read in a text/csv file with >> >>>>>> different data types and convert it into a recarray without >> having to >> >>>>>> pre-specify the dtypes or variables names. I am just too lazy to >> type-in >> >>>>>> stuff like that :) The supported types are int, float, dates, and >> >>>>>> strings. >> >>>>>> >> >>>>>> I works pretty well but it is not (yet) as fast as I would like >> so I was >> >>>>>> wonder if any of the numpy experts on this list might have some >> >>>>>> suggestion >> >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is >> >>>>>> important >> >>>>>> for me. > >>>>> > >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the > >>>>> same. You may want to compare implementations in case we can > >>>>> fruitfully cross pollinate them. In the examples directy, there > is an > >>>>> example script examples/loadrec.py > >>>>> _______________________________________________ > >>>>> Numpy-discussion mailing list > >>>>> Numpy-discussion@scipy.org > >>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion > >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Numpy-discussion mailing list >>>> Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org> >>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion >>>> >> _______________________________________________ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion <http://projects.scipy.org/mailman/listinfo/numpy-discussion>
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
FWIW
n,dt=descr[0] new_dt=dt.replace('f','i') descr[0]=(n,new_dt) data=ra.col1.astype(new_dt) ra.dtype=N.dtype(descr) ra.col1=data
//Torgil On 7/9/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Tim,
I do want to auto-detect. Reading numbers in as floats is probably not a huge penalty.
Is there an easy way to change the type of one column in a recarray that you know?
I tried this:
ra.col1 = ra.col1.astype('i')
but that didn't seem to work. I assume that means you would have to create a new array from the old one with an updated dtype list.
Thanks,
Vincent
On 7/8/07 4:51 PM, "Timothy Hochberg" <tim.hochberg@ieee.org> wrote:
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Torgil,
The function seems to work well and is slightly faster than your previous version (about 1/6th faster).
Yes, I do have columns that start with, what looks like, int's and then turnTim,
out to be floats. Something like below (col6).
data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], ['1','3','1/97','1.12','2.11','0'], ['1','2','3/97',' 1.21','3.12','0'], ['2','1','2/97','1.12','2.11','0'], ['2','2','4/97','1.33','2.26',' 1.23'], ['2','2','5/97','1.73','2.42','1.26']]
I think what your function assumes is that the 1st element will be the appropriate type. That may not hold if you have missing values or 'mixed types'.
Vincent,
Do you need to auto detect the column types? Things get a lot simpler if you have some known schema for each file; then you can simply pass that to some reader function. It's also more robust since there's no way in general to differentiate a column of integers from a column of floats with no decimal part.
If you do need to auto detect, one approach would be to always read both int-like stuff and float-like stuff in as floats. Then after you get the array check over the various columns and if any have no fractional parts, make a new array where those columns are integers.
-tim
Best,
Vincent
On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
Hi
I stumble on these types of problems from time to time so I'm interested in efficient solutions myself.
Do you have a column which starts with something suitable for int on the first row (without decimal separator) but has decimals further down?
This will be little tricky to support. One solution could be to yield StopIteration, calculate new type-conversion-functions and start over iterating over both the old data and the rest of the iterator.
It'd be great if you could try the load_gen_iter.py I've attached to my response to Tim.
Best Regards,
//Torgil
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int's (e.g., 0's), but then the rest of that same column could be floats. I guess finding
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote: the
right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int ( i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu <mailto:v-nijs@kellogg.northwestern.edu> > wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote: > I wrote the attached (small) program to read in a text/csv file with > different data types and convert it into a recarray without having to > pre-specify the dtypes or variables names. I am just too lazy to type-in > stuff like that :) The supported types are int, float, dates, and > strings. > > I works pretty well but it is not (yet) as fast as I would like so I was > wonder if any of the numpy experts on this list might have some > suggestion > on how to speed it up. I need to read 500MB-1GB files so speed is > important > for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org>
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion <http://projects.scipy.org/mailman/listinfo/numpy-discussion>
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
In your case the floats don't pass as ints since you have decimals. The attached file takes another approach (sorry for lack of comments). If the conversion fail, the current row is stored and the iterator exits (without setting a 'finished' parameter to true). The program then re-calculates the conversion-functions and checks for changes. If the changes are supported (=we have a conversion function for old data in the format_changes dictionary) it calls fromiter again with an iterator like this: def get_data_iterator(row_iter,delim,res): for x0,x1,x2,x3,x4,x5 in res['data']: x0=float(x0) print (x0,x1,x2,x3,x4,x5) yield (x0,x1,x2,x3,x4,x5) yield (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('1.23')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) try: yield (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) except: res['row']=row return res['finished']=True res['data'] is the previously converted data. This has the obvious disadvantage that if only the last row has fractions in a column, it'll cost double memory. Also if many columns change format at different places it has to re-convert every time. I don't recommend this because of the drawbacks and extra complexity. I think it is best to convert your files (or file generation) so that float columns are represented with 0.0 instead of 0. Best Regards, //Torgil On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed. Vincent On 7/8/07 5:40 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
In your case the floats don't pass as ints since you have decimals. The attached file takes another approach (sorry for lack of comments). If the conversion fail, the current row is stored and the iterator exits (without setting a 'finished' parameter to true). The program then re-calculates the conversion-functions and checks for changes. If the changes are supported (=we have a conversion function for old data in the format_changes dictionary) it calls fromiter again with an iterator like this:
def get_data_iterator(row_iter,delim,res): for x0,x1,x2,x3,x4,x5 in res['data']: x0=float(x0) print (x0,x1,x2,x3,x4,x5) yield (x0,x1,x2,x3,x4,x5) yield (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float(' 1.23')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) try: yield (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) except: res['row']=row return res['finished']=True
res['data'] is the previously converted data. This has the obvious disadvantage that if only the last row has fractions in a column, it'll cost double memory. Also if many columns change format at different places it has to re-convert every time.
I don't recommend this because of the drawbacks and extra complexity. I think it is best to convert your files (or file generation) so that float columns are represented with 0.0 instead of 0.
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I am not (yet) very familiar with much of the functionality introduced in your script Torgil (izip, imap, etc.), but I really appreciate you taking the time to look at this!
The program stopped with the following error:
File "load_iter.py", line 48, in <genexpr> convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r)) ValueError: invalid literal for int() with base 10: '2174.875'
A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s), but then the rest of that same column could be floats. I guess finding the right conversion function is the tricky part. I was thinking about sampling each, say, 10th obs to test which function to use. Not sure how that would work however.
If I ignore the option of an int (i.e., everything is a float, date, or string) then your script is about twice as fast as mine!!
Question: If you do ignore the int's initially, once the rec array is in memory, would there be a quick way to check if the floats could pass as int's? This may seem like a backwards approach but it might be 'safer' if you really want to preserve the int's.
Thanks again!
Vincent
On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
Best Regards,
//Torgil
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for the reference John! csv2rec is about 30% faster than my code on the same data.
If I read the code in csv2rec correctly it converts the data as it is being read using the csv modules. My setup reads in the whole dataset into an array of strings and then converts the columns as appropriate.
Best,
Vincent
On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I wrote the attached (small) program to read in a text/csv file with different data types and convert it into a recarray without having to pre-specify the dtypes or variables names. I am just too lazy to type-in stuff like that :) The supported types are int, float, dates, and strings.
I works pretty well but it is not (yet) as fast as I would like so I was wonder if any of the numpy experts on this list might have some suggestion on how to speed it up. I need to read 500MB-1GB files so speed is important for me.
In matplotlib.mlab svn, there is a function csv2rec that does the same. You may want to compare implementations in case we can fruitfully cross pollinate them. In the examples directy, there is an example script examples/loadrec.py _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed. The implementation is a little rough, but it should get the idea across. -- . __ . |-\ . . tim.hochberg@ieee.org ======================================================================== def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0 def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb') if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None info = LoadInfo(f.next()) chunks = [] while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn) chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr)) if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[-1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks # load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n" return data
Cool! Thanks Tim. Vincent On 7/8/07 10:25 PM, "Timothy Hochberg" <tim.hochberg@ieee.org> wrote:
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
I combined some of the very useful comments/code from Tim and Torgil and came-up with the attached program to read csv files and convert the data into a recarray. I couldn¹t use all of their suggestions because, frankly, I didn¹t understand all of them :) The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome. Thanks, Vincent
Nice, I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments: 1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ? 2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent" 3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ? 4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case? Best regards, //Torgil On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil and came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Hi, I am trying to write a couple of simple functions to (1) save recarray's to an sqlite database and (2) load a recarray from an sqllite database. I am stuck on 2 points and hope there are some people on this list that use sqlite for numpy stuff. 1. How to detect the variable names and types from the sqlite database? I am using: conn = sqlite3.connect(fname,detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COL NAMES) but then how do you access the variable names and types and convert them to numpy types? 2. In saving the recarray to sqlite I need to get data types from data.dtype.descr and transform the names to types that sqlite knows: string --> text int --> integer float --> real I tried some things like: for i in data[0]: if type(i) == str This didn't work because the elements are numpy.strings and I couldn't get the comparison to work. I'd rather use the dtype descriptions directly but couldn't figure out how to do that either. Any suggestions are very welcome. Thanks!! Vincent
Hi Torgil, 1. I got an email from Tim about this issue: "I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with.² The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int¹s and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim¹s comments it is a pretty decent solution for the types of data I have access to. 2. Fixed the spelling error :) 3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip! 4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user. Best, Vincent On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil and came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
Hi, 1. Your code is fast due to that you convert whole at once columns in numpy. The first step with the lists is also very fast (python implements lists as arrays). I like your version, I think it's as fast as it gets in pure python and has to keep only two versions of the data at once in memory (since the string versions can be garbage collected). If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does. 4. Okay, that makes sense. I was confused by the fact that your generated function had the same name as the builtin iter() operator. //Torgil On 7/19/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with."
The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int's and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim's comments it is a pretty decent solution for the types of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil
and
came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Hi again, On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
I discovered that my script was broken and too complex. The attached script is much cleaner and has better error messages. Best regards, //Torgil On 7/19/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Hi,
1. Your code is fast due to that you convert whole at once columns in numpy. The first step with the lists is also very fast (python implements lists as arrays). I like your version, I think it's as fast as it gets in pure python and has to keep only two versions of the data at once in memory (since the string versions can be garbage collected).
If memory really is an issue, you have the nice "load_spec" version and can always convert the files once by iterating over the file twice like the attached script does.
4. Okay, that makes sense. I was confused by the fact that your generated function had the same name as the builtin iter() operator.
//Torgil
On 7/19/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Hi Torgil,
1. I got an email from Tim about this issue:
"I finally got around to doing some more quantitative comparisons between your code and the more complicated version that I proposed. The idea behind my code was to minimize memory usage -- I figured that keeping the memory usage low would make up for any inefficiencies in the conversion process since it's been my experience that memory bandwidth dominates a lot of numeric problems as problem sized get reasonably large. I was mostly wrong. While it's true that for very large file sizes I can get my code to outperform yours, in most instances it lags behind. And the range where it does better is a fairly small range right before the machine dies with a memory error. So my conclusion is that the extra hoops my code goes through to avoid allocating extra memory isn't worth it for you to bother with."
The approach in my code is simple and robust to most data issues I could come-up with. It actually will do an appropriate conversion if there are missing values or int's and float in the same column. It will select an appropriate string length as well. It may not be the most memory efficient setup but given Tim's comments it is a pretty decent solution for the types of data I have access to.
2. Fixed the spelling error :)
3. I guess that is the same thing. I am not very familiar with zip, izip, map etc. just yet :) Thanks for the tip!
4. I called the function generated using exec, iter(). I need that function to transform the data using the types provided by the user.
Best,
Vincent
On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
Nice,
I haven't gone through all details. That's a nice new "missing" feature, maybe all instances where we can't find a conversion should be "nan". A few comments:
1. The "load_search" functions contains all memory/performance overhead that we wanted to avoid with the fromiter function. Does this mean that you no longer have large text-files that change sting representation in the columns (aka "0" floats) ?
2. ident=" "*4 This has the same spelling error as in my first compile try .. it was meant to be "indent"
3. types = list((i,j) for i, j in zip(varnm, types2)) Isn't this the same as "types = zip(varnm, types2)" ?
4. return N.fromiter(iter(reader),dtype = types) Isn't "reader" an iterator already? What does the "iter()" operator do in this case?
Best regards,
//Torgil
On 7/18/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
I combined some of the very useful comments/code from Tim and Torgil
and
came-up with the attached program to read csv files and convert the data into a recarray. I couldn't use all of their suggestions because, frankly, I didn't understand all of them :)
The program use variable names if provided in the csv-file and can auto-detect data types. However, I also wanted to make it easy to specify data types and/or variables names if so desired. Examples are at the bottom of the file. Comments are very welcome.
Thanks,
Vincent _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually. I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences. saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass Best, Vincent
On Thu, Jul 19, 2007 at 09:42:42PM -0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipy-user/2007-April/011724.html but I see that you participated in it. I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do. So I think the choices between pytables and cPickle boils down to whether you want to share the data with other software than Python or not. Gaël
Gael Varoquaux wrote:
On Thu, Jul 19, 2007 at 09:42:42PM -0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipy-user/2007-April/011724.html
but I see that you participated in it.
I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do.
I want that Matlab script! I have colleagues with whom the least common denominator is currently .mat files. I'd be much happier if it was hdf5 files. Can you post it on the scipy wiki cookbook? (Or the pytables wiki?) Cheers! Andrew
On Fri, Jul 20, 2007 at 01:59:13AM -0700, Andrew Straw wrote:
I want that Matlab script!
I new I really should put these things on line, I have just been wanting to iron them a bit, but it has been almost two year since I have touched these, so ... http://scipy.org/Cookbook/hdf5_in_Matlab Feel free to improve them, and to write similar scripts in Python. Gaël
Gael Varoquaux (el 2007-07-20 a les 11:24:34 +0200) va dir::
I new I really should put these things on line, I have just been wanting to iron them a bit, but it has been almost two year since I have touched these, so ...
Wow, that looks really sweet and simple, useful code. Great! :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
Gael, Sounds very interesting! Would you mind sharing an example (with code if possible) of how you organize your experimental data in pytables. I have been thinking about how I might organize my data in pytables and would luv to hear how an experienced user does that. Given the speed differences it looks like pytables is going to be a better solution for my needs. Still curious however ... does no one on this list use (and like) sqlite? Could anyone suggest any other list where I might find users of python and sqlite (and numpy)? Thanks, Vincent On 7/20/07 1:16 AM, "Gael Varoquaux" <gael.varoquaux@normalesup.org> wrote:
On Thu, Jul 19, 2007 at 09:42:42PM -0500, Vincent Nijs wrote:
I'd luv to hear from people using sqlite, pytables, and cPickle about their experiences.
I was about to point you to this discussion: http://projects.scipy.org/pipermail/scipy-user/2007-April/011724.html
but I see that you participated in it.
I store data from each of my experimental run with pytables. What I like about it is the hierarchical organization of the data which allows me to save a complete description of the experiment, with strings, and extensible data structures. Another thing I like is that I can load this in Matlab (I can provide enhanced script for hdf5, if somebody wants them), and I think it is possible to read hdf5 in Origin. I don't use these software, but some colleagues do.
So I think the choices between pytables and cPickle boils down to whether you want to share the data with other software than Python or not.
Gaël _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On 7/20/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Gael,
Sounds very interesting! Would you mind sharing an example (with code if possible) of how you organize your experimental data in pytables. I have been thinking about how I might organize my data in pytables and would luv to hear how an experienced user does that.
Given the speed differences it looks like pytables is going to be a better solution for my needs.
Still curious however ... does no one on this list use (and like) sqlite?
Could anyone suggest any other list where I might find users of python and sqlite (and numpy)?
You could try the db-sig. You can get to the archives, and I imagine subscribe to it, from: http://www.python.org/community/sigs/current/ I don't know if that'll be helpful for you, but I imagine that they know something about python + sqlllite. -- . __ . |-\ . . tim.hochberg@ieee.org
On Fri, Jul 20, 2007 at 08:35:51AM -0500, Vincent Nijs wrote:
Sounds very interesting! Would you mind sharing an example (with code if possible) of how you organize your experimental data in pytables. I have been thinking about how I might organize my data in pytables and would luv to hear how an experienced user does that.
I can show you the processing code. The experiment I have close to me is run by Matlab, the one that is fully controlled by Python is a continent away. Actually, I am really lazy, so I am just going to copy brutally the IO module. Something that can be interesting is that the data is saved by the expirement control framework on a computer (called Krubcontrol), this data can then be retrieve using the "fetch_files" Python command, that puts it on the server and logs it into a data base like hash table. When we want to retrieve the data we have a special object krubdata, which uses some fancy indexing to retrieve by data, or specifying the keywords. I am sorry I am not providing the code that is writing the hdf5 files, it is an incredible useless mess, trust me. I would be able to factor out the output code out of the 5K matlab lines. Hopefuly you'll be able to get an idea of the structure of the hdf5 files by looking at the code that does the loading. I haven't worked with this data for a while, so I can't tell you Some of the Python code might be useful to others, especially the hashing and retrieving part. The reason why I didn't use a relational DB is that I simply don't trust them enough for my precious data. Gaël
Vincent, A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
Still curious however ... does no one on this list use (and like) sqlite?
First of all, while I'm not a heavy user of relational databases, I've used them as references for benchmarking purposes. Hence, based on my own benchmarking experience, I'd say that, for writing, relational databases do take a lot of safety measures to ensure that all the data that is written to the disk is safe and that the data relationships don't get broken, and that takes times (a lot of time, in fact). I'm not sure about whether some of these safety measures can be relaxed, but even though some relational databases would allow this, my feel (beware, I can be wrong) is that you won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not observing security measures in that regard because they are not thought for these tasks). In this sense, the best writing speed that I was able to achieve with Postgres (I don't know whether sqlite support this) is by simulating that your data comes from a file stream and using the "cursor.copy_from()" method. Using this approach I was able to accelerate a 10x (if I remember well) the injecting speed, but even with this, PyTables can be another 10x faster. You can see an exemple of usage in the Postgres backend [1] used for doing the benchmarks for comparing PyTables and Postgres speeds. Regarding reading speed, my diggins [2] seems to indicate that the bottleneck here is not related with safety, but with the need of the relational databases pythonic APIs of wrapping *every* element retrieved out of the database with a Python container (int, float, string...). On the contrary, PyTables does take advantage of creating an empty recarray as the container to keep all the retrieved data, and that's very fast compared with the former approach. To somewhat quantify this effect in function of the size of the dataset retrieved, you can see the figure 14 of [3] (as you can see, the larger the dataset retrieved, the larger the difference in terms of speed). Incidentally, and as it is said there, I'm hoping that NumPy containers should eventually be discovered by relational database wrappers makers, so these wrapping times would be removed completely, but I'm currently not aware of any package taking this approach. [1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py [2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704 [3] http://www.carabos.com/docs/OPSI-indexes.pdf Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Another small note: I'm pretty sure sqlite stores everything as strings. This just plain has to be slower than storing the raw binary representation (and may mean for slight differences in fp values on the round-trip). HDF is designed for this sort of thing, sqlite is not. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
A Divendres 20 Juliol 2007 20:16, Christopher Barker escrigué:
Another small note:
I'm pretty sure sqlite stores everything as strings. This just plain has to be slower than storing the raw binary representation (and may mean for slight differences in fp values on the round-trip). HDF is designed for this sort of thing, sqlite is not.
Yeah, that was the case with sqlite 2. However, starting with sqlite 3, developers provided the ability to store integer and real numbers in a more compact format [1]. Sqlite 3 is the version included in Python 2.5 (the python version that Vincent was benchmarking), so this shouldn't make a big difference compared with other relational databases. [1] http://www.sqlite.org/datatype3.html Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
FYI I asked a question about the load and save speed of recarray's using pickle vs pysqlite on the pysqlite list and got the response linked below. Doesn't look like sqlite can do much better than what I found. http://lists.initd.org/pipermail/pysqlite/2007-July/001085.html I also passed on Francesc's idea to use numpy containers in relational database wrappers such as pysqlite. This is apparently not possible since "in a "relational database you don't know the type of the values in advance. Some values might be NULL" and "and you might even have different types for the same column" http://lists.initd.org/pipermail/pysqlite/2007-July/001087.html I would assume the NULL's could be treated as missing values (?) Don't know about the different types in one column however. Vincent On 7/20/07 10:53 AM, "Francesc Altet" <faltet@carabos.com> wrote:
Vincent,
A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
Still curious however ... does no one on this list use (and like) sqlite?
First of all, while I'm not a heavy user of relational databases, I've used them as references for benchmarking purposes. Hence, based on my own benchmarking experience, I'd say that, for writing, relational databases do take a lot of safety measures to ensure that all the data that is written to the disk is safe and that the data relationships don't get broken, and that takes times (a lot of time, in fact). I'm not sure about whether some of these safety measures can be relaxed, but even though some relational databases would allow this, my feel (beware, I can be wrong) is that you won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not observing security measures in that regard because they are not thought for these tasks).
In this sense, the best writing speed that I was able to achieve with Postgres (I don't know whether sqlite support this) is by simulating that your data comes from a file stream and using the "cursor.copy_from()" method. Using this approach I was able to accelerate a 10x (if I remember well) the injecting speed, but even with this, PyTables can be another 10x faster. You can see an exemple of usage in the Postgres backend [1] used for doing the benchmarks for comparing PyTables and Postgres speeds.
Regarding reading speed, my diggins [2] seems to indicate that the bottleneck here is not related with safety, but with the need of the relational databases pythonic APIs of wrapping *every* element retrieved out of the database with a Python container (int, float, string...). On the contrary, PyTables does take advantage of creating an empty recarray as the container to keep all the retrieved data, and that's very fast compared with the former approach. To somewhat quantify this effect in function of the size of the dataset retrieved, you can see the figure 14 of [3] (as you can see, the larger the dataset retrieved, the larger the difference in terms of speed). Incidentally, and as it is said there, I'm hoping that NumPy containers should eventually be discovered by relational database wrappers makers, so these wrapping times would be removed completely, but I'm currently not aware of any package taking this approach.
[1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py [2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704 [3] http://www.carabos.com/docs/OPSI-indexes.pdf
Cheers,
Vincent Nijs (el 2007-07-22 a les 10:21:18 -0500) va dir::
[...] I would assume the NULL's could be treated as missing values (?) Don't know about the different types in one column however.
Maybe a masked array would do the trick, with NULL values masked out. :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences.
saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass
loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform PyTables about the expected number of rows (see [1]) that you will end feeding into the tables so that it can choose the best chunksize for I/O purposes. I've redone the benchmarks (the new script is attached) with this 'optimization' on and here are my numbers: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.3 Zlib version: 1.2.3 LZO version: 2.01 (Jun 27 2005) Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] Platform: linux2-x86_64 Byte-ordering: little -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test saving recarray using cPickle: 0.197113 sec/pass Test saving recarray with pytables: 0.234442 sec/pass Test saving recarray with pytables (with zlib): 1.973649 sec/pass Test saving recarray with pytables (with lzo): 0.925558 sec/pass Test loading recarray using cPickle: 0.151379 sec/pass Test loading recarray with pytables: 0.165399 sec/pass Test loading recarray with pytables (with zlib): 0.553251 sec/pass Test loading recarray with pytables (with lzo): 0.264417 sec/pass As you can see, the differences between raw cPickle and PyTables are much less than not informing about the total number of rows. In fact, an automatic optimization can easily be done in PyTables so that when the user is passing a recarray, the total length of the recarray would be compared with the default number of expected rows (currently 10000), and if the former is larger, then the length of the recarray should be chosen instead. I also have added the times when using compression just in case you are interested using it. Here are the final file sizes: $ ls -sh data total 132M 24M data-lzo.h5 43M data-None.h5 43M data.pickle 25M data-zlib.h5 Of course, this is using completely random data, but with real data the compression levels are expected to be higher than this. [1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Thanks Francesc! That does work much better: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.4.dev3852 Zlib version: 1.2.3 BZIP2 version: 1.0.2 (30-Dec-2001) Python version: 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] Platform: darwin-Power Macintosh Byte-ordering: big -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test saving recarray using cPickle: 1.620880 sec/pass Test saving recarray with pytables: 2.074591 sec/pass Test saving recarray with pytables (with zlib): 14.320498 sec/pass Test loading recarray using cPickle: 1.023015 sec/pass Test loading recarray with pytables: 0.882411 sec/pass Test loading recarray with pytables (with zlib): 3.692698 sec/pass On 7/20/07 6:17 AM, "Francesc Altet" <faltet@carabos.com> wrote:
A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences.
saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass
loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform PyTables about the expected number of rows (see [1]) that you will end feeding into the tables so that it can choose the best chunksize for I/O purposes.
I've redone the benchmarks (the new script is attached) with this 'optimization' on and here are my numbers:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.3 Zlib version: 1.2.3 LZO version: 2.01 (Jun 27 2005) Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] Platform: linux2-x86_64 Byte-ordering: little -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test saving recarray using cPickle: 0.197113 sec/pass Test saving recarray with pytables: 0.234442 sec/pass Test saving recarray with pytables (with zlib): 1.973649 sec/pass Test saving recarray with pytables (with lzo): 0.925558 sec/pass
Test loading recarray using cPickle: 0.151379 sec/pass Test loading recarray with pytables: 0.165399 sec/pass Test loading recarray with pytables (with zlib): 0.553251 sec/pass Test loading recarray with pytables (with lzo): 0.264417 sec/pass
As you can see, the differences between raw cPickle and PyTables are much less than not informing about the total number of rows. In fact, an automatic optimization can easily be done in PyTables so that when the user is passing a recarray, the total length of the recarray would be compared with the default number of expected rows (currently 10000), and if the former is larger, then the length of the recarray should be chosen instead.
I also have added the times when using compression just in case you are interested using it. Here are the final file sizes:
$ ls -sh data total 132M 24M data-lzo.h5 43M data-None.h5 43M data.pickle 25M data-zlib.h5
Of course, this is using completely random data, but with real data the compression levels are expected to be higher than this.
[1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim
Cheers,
-- Vincent R. Nijs Assistant Professor of Marketing Kellogg School of Management, Northwestern University 2001 Sheridan Road, Evanston, IL 60208-2001 Phone: +1-847-491-4574 Fax: +1-847-491-2498 E-mail: v-nijs@kellogg.northwestern.edu Skype: vincentnijs
Elegant solution. Very readable and takes care of row0 nicely. I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets. I think we can't have best of both and Tims solution is better in the general case. Maybe "use one_alt if rownumber < xxx else use other_alt" can fine-tune performance for some cases. but even ten, with many cols, it's nearly impossible to know. //Torgil On 7/9/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the data end. Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in the way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new dtype. At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
-- . __ . |-\ . . tim.hochberg@ieee.org
========================================================================
def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats
class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0
def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True
def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb')
if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None
info = LoadInfo(f.next()) chunks = []
while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols ) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn)
chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr))
if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[-1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks
# load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n"
return data
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On 7/9/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Elegant solution. Very readable and takes care of row0 nicely.
I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets.
Yep. I think we can't have best of both and Tims solution is better in the
general case.
It probably would not be hard to do a hybrid version. One issue is that one doesn't, in general, know the size of the dataset in advance, so you'd have to use an absolute criteria (less than 100 lines) instead of a relative criteria (less than 20% done). I suppose you could stat the file or something, but that seems like overkill. Maybe "use one_alt if rownumber < xxx else use other_alt" can
fine-tune performance for some cases. but even ten, with many cols, it's nearly impossible to know.
That sounds sensible. I have an interesting thought on how to this that's a bit hard to describe. I'll try to throw it together and post another version today or tomorrow. //Torgil
On 7/9/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
Thanks for looking into this Torgil! I agree that this is a much more complicated setup. I'll check if there is anything I can do on the
end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats and convert to int later as needed.
Here is a strategy that should allow auto detection without too much in
way of inefficiency. The basic idea is to convert till you run into a problem, store that data away, and continue the conversion with a new
data the dtype.
At the end you assemble all the chunks of data you've accumulated into one large array. It should be reasonably efficient in terms of both memory and speed.
The implementation is a little rough, but it should get the idea across.
-- . __ . |-\ . . tim.hochberg@ieee.org
========================================================================
def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats
class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0
def data_iterator(lines, converters, delim, info): yield tuple(f(x) for f, x in zip(converters, info.row0.split (delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split (delim))) except: info.row0 = row else: info.done = True
def load2(fname,delim = ',', has_varnm = True, prn_report = True): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb')
if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None
info = LoadInfo(f.next()) chunks = []
while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols ) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn)
chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr))
if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[-1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks
# load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n"
return data
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
On 7/9/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
On 7/9/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Elegant solution. Very readable and takes care of row0 nicely.
I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets.
Yep.
I think we can't have best of both and Tims solution is better in the
general case.
It probably would not be hard to do a hybrid version. One issue is that one doesn't, in general, know the size of the dataset in advance, so you'd have to use an absolute criteria (less than 100 lines) instead of a relative criteria (less than 20% done). I suppose you could stat the file or something, but that seems like overkill.
Maybe "use one_alt if rownumber < xxx else use other_alt" can
fine-tune performance for some cases. but even ten, with many cols, it's nearly impossible to know.
That sounds sensible. I have an interesting thought on how to this that's a bit hard to describe. I'll try to throw it together and post another version today or tomorrow.
OK, as promised, here's an approach that rebuilds the array if the format changes as long as the less than 'restart_length' lines have been processed. Otherwise, it uses the old strategy. Perhaps not the most efficient way, but it reuses what I'd already written with minimal changes. It's still pretty rough -- once again I didn't bother to polish it. def find_formats(items, last): formats = [] for i, x in enumerate(items): dt, cvt = string_to_dt_cvt(x) if last is not None: last_cvt, last_dt = last[i] if last_cvt is float and cvt is int: cvt = float formats.append((dt, cvt)) return formats class LoadInfo(object): def __init__(self, row0): self.done = False self.lastcols = None self.row0 = row0 self.predata = () def data_iterator(lines, converters, delim, info): for x in info.predata: yield x info.predata = () yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) try: for row in lines: yield tuple(f(x) for f, x in zip(converters, row.split(delim))) except: info.row0 = row else: info.done = True def load2(fname,delim = ',', has_varnm = True, prn_report = True, restart_length=20): """ Loading data from a file using the csv module. Returns a recarray. """ f=open(fname,'rb') if has_varnm: varnames = [i.strip() for i in f.next().split(delim)] else: varnames = None info = LoadInfo(f.next()) chunks = [] while not info.done: row0 = info.row0.split(delim) formats = find_formats(row0, info.lastcols) if varnames is None: varnames = varnm = ['col%s' % str(i+1) for i, _ in enumerate(formate)] descr=[] conversion_functions=[] for name, (dtype, cvt_fn) in zip(varnames, formats): descr.append((name,dtype)) conversion_functions.append(cvt_fn) if len(chunks) == 1 and len(chunks[0]) < restart_length: info.predata = chunks[0].astype(descr) chunks = [] chunks.append(N.fromiter(data_iterator(f, conversion_functions, delim, info), descr)) if len(chunks) > 1: n = sum(len(x) for x in chunks) data = N.zeros([n], chunks[-1].dtype) offset = 0 for x in chunks: delta = len(x) data[offset:offset+delta] = x offset += delta else: [data] = chunks # load report if prn_report: print "##########################################\n" print "Loaded file: %s\n" % fname print "Nr obs: %s\n" % data.shape[0] print "Variables and datatypes:\n" for i in data.dtype.descr: print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3])) print "\n##########################################\n" return data
participants (11)
-
Andrew Straw
-
Christopher Barker
-
Francesc Altet
-
Gael Varoquaux
-
Ivan Vilata i Balaguer
-
John Hunter
-
Timothy Hochberg
-
Tom Denniston
-
Torgil Svensson
-
Travis Oliphant
-
Vincent Nijs