speed question, reading csv using takewhile() and dropwhile()

Vincent Davis vincent at vincentdavis.net
Fri Feb 19 22:58:41 CET 2010


In reference to the several comments about "[x for x in read] is basically a
copy of the entire list. This isn't necessary." or list(read). I had thought
I had a problem with having iterators in the takewhile() statement. I
thought I testes and it didn't work. It seems I was wrong. It clearly works.
I'll make this change and see if it is any better.

I actually don't plan to read them all in at once, only as needed, but I do
need the whole file in an array to perform some mathematics on them and
compare different files. So my interest was in making it faster to open them
as needed. I guess part of it is that they are about 5mb so I guess it might
be disk speed in part.
Thanks

*Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Fri, Feb 19, 2010 at 2:13 PM, Jonathan Gardner <
jgardner at jonathangardner.net> wrote:

> On Fri, Feb 19, 2010 at 10:22 AM, Vincent Davis <vincent at vincentdavis.net>wrote:
>
>> I have some some (~50) text files that have about 250,000 rows each. I am
>> reading them in using the following which gets me what I want. But it is not
>> fast. Is there something I am missing that should help. This is mostly an
>> question to help me learn more about python. It takes about 4 min right now.
>>
>> def read_data_file(filename):
>>     reader = csv.reader(open(filename, "U"),delimiter='\t')
>>     read = list(reader)
>>
>
> You're slurping the entire file here when it's not necessary.
>
>
>>     data_rows = takewhile(lambda trow: '[MASKS]' not in trow, [x for x in
>> read])
>>
>
> [x for x in read] is basically a copy of the entire list. This isn't
> necessary.
>
>
>>      data = [x for x in data_rows][1:]
>>
>>
>
> Again, copying here is unnecessary.
>
> [x for x in y] isn't a paradigm in Python. If you really need a copy of an
> array, x = y[:] is the paradigm.
>
>
>>     mask_rows = takewhile(lambda trow: '[OUTLIERS]' not in trow,
>> list(dropwhile(lambda drow: '[MASKS]' not in drow, read)))
>>
>
>
>
>
>
>>     mask = [row for row in mask_rows if row][3:]
>>
>
> Here's another unnecessary array copy.
>
>
>>
>>     outlier_rows = dropwhile(lambda drows: '[OUTLIERS]' not in drows,
>> read)
>>     outlier = [row for row in outlier_rows if row][3:]
>>
>
>
> And another.
>
> Just because you're using Python doesn't mean you get to be silly in how
> you move data around. Avoid copies as much as possible, and try to avoid
> slurping in large files all at once. Line-by-line processing is best.
>
> I think you should invert this operation into a for loop. Most people tend
> to think of things better that way than chained iterators. It also helps you
> to not duplicate data when it's unnecessary.
>
> --
> Jonathan Gardner
> jgardner at jonathangardner.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100219/8cdfb299/attachment.html>


More information about the Python-list mailing list