On 7/8/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
On 7/8/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Given that both your script and the mlab version preloads the whole file before calling numpy constructor I'm curious how that compares in speed to using numpy's fromiter function on your data. Using fromiter should improve on memory usage (~50% ?).
The drawback is for string columns where we don't longer know the width of the largest item. I made it fall-back to "object" in this case.
Attached is a fromiter version of your script. Possible speedups could be done by trying different approaches to the "convert_row" function, for example using "zip" or "enumerate" instead of "izip".
I suspect that you'd do better here if you removed a bunch of layers from the conversion functions. Right now it looks like: imap->chain->convert_row->tuple->generator->izip. That's five levels deep and Python functions are reasonably expensive. I would try to be a lot less clever and do something like:
def data_iterator(row_iter, delim): row0 = row_iter.next().split(delim) converters = find_formats(row0) # left as an exercise yield tuple(f(x) for f, x in zip(conversion_functions, row0)) for row in row_iter: yield tuple(f(x) for f, x in zip(conversion_functions, row0))
That sounds sane. I've maybe been attracted to bad habits here and got away with it since i'm very i/o-bound in these cases. My main objective has been reducing memory footprint to reduce swapping.
That's just a sketch and I haven't timed it, but it cuts a few levels out of the call chain, so has a reasonable chance of being faster. If you wanted to be really clever, you could use some exec magic after you figure out the conversion functions to compile a special function that generates the tuples directly without any use of tuple or zip. I don't have time to work through the details right now, but the code you would compile would end up looking this:
for (x0, x1, x2) in row_iter: yield (int(x0), float(x1), float(x2))
Here we've assumed that find_formats determined that there are three fields, an int and two floats. Once you have this info you can build an appropriate function and exec it. This would cut another couple levels out of the call chain. Again, I haven't timed it, or tried it, but it looks like it would be fun to try.
-tim
Thank you for the lesson! Great tip. This opens up for a variety of new coding options. I've made an attempt on the fun part. Attached are a version that generates the following generator code for Vincent's __main__=='__name__' - code: def get_data_iterator(row_iter,delim): yield (int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2')) for row in row_iter: x0,x1,x2,x3,x4,x5=row.split(delim) yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5)) Best Regards, //Torgil