speed question, reading csv using takewhile() and dropwhile()

Fri Feb 19 16:13:09 EST 2010

On Fri, Feb 19, 2010 at 10:22 AM, Vincent Davis <vincent at vincentdavis.net>wrote:

> I have some some (~50) text files that have about 250,000 rows each. I am
> reading them in using the following which gets me what I want. But it is not
> fast. Is there something I am missing that should help. This is mostly an
> question to help me learn more about python. It takes about 4 min right now.
>
> def read_data_file(filename):
>     reader = csv.reader(open(filename, "U"),delimiter='\t')
>     read = list(reader)
>

You're slurping the entire file here when it's not necessary.

>     data_rows = takewhile(lambda trow: '[MASKS]' not in trow, [x for x in
> read])
>

[x for x in read] is basically a copy of the entire list. This isn't
necessary.

>     data = [x for x in data_rows][1:]
>
>

Again, copying here is unnecessary.

[x for x in y] isn't a paradigm in Python. If you really need a copy of an
array, x = y[:] is the paradigm.

>     mask_rows = takewhile(lambda trow: '[OUTLIERS]' not in trow,
> list(dropwhile(lambda drow: '[MASKS]' not in drow, read)))
>

>     mask = [row for row in mask_rows if row][3:]
>

Here's another unnecessary array copy.

>
>     outlier_rows = dropwhile(lambda drows: '[OUTLIERS]' not in drows, read)
>     outlier = [row for row in outlier_rows if row][3:]
>

And another.

Just because you're using Python doesn't mean you get to be silly in how you
move data around. Avoid copies as much as possible, and try to avoid
slurping in large files all at once. Line-by-line processing is best.

I think you should invert this operation into a for loop. Most people tend
to think of things better that way than chained iterators. It also helps you
to not duplicate data when it's unnecessary.

-- 
Jonathan Gardner
jgardner at jonathangardner.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100219/427b138d/attachment-0001.html>