speed question, reading csv using takewhile() and dropwhile()

Vincent Davis vincent at vincentdavis.net
Sat Feb 20 20:32:28 EST 2010


Thanks again for the comment, not sure I will implement all of it but I will
separate the "if not row" The files have some extraneous blank rows in the
middle that I need to be sure not to import as blank rows.
I am actually having trouble with this filling my sys memory, I posted a
separate question "Why is this filling my sys memory" or something like that
is the subject.
I might be that my 1yr old son has been trying to help for the last hour. It
is very distracting.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Sat, Feb 20, 2010 at 6:18 PM, Jonathan Gardner <
jgardner at jonathangardner.net> wrote:

> On Sat, Feb 20, 2010 at 4:21 PM, Vincent Davis <vincent at vincentdavis.net>wrote:
>
>> Thanks for the help, this is considerably faster and easier to read (see
>> below). I changed it to avoid the "break" and I think it makes it easy to
>> understand. I am checking the conditions each time slows it but it is worth
>> it to me at this time.
>>
>>
> It seems you are beginning to understand that programmer time is more
> valuable than machine time. Congratulations.
>
>
>
>> def read_data_file(filename):
>>     reader = csv.reader(open(filename, "U"),delimiter='\t')
>>
>>     data = []
>>     mask = []
>>     outliers = []
>>     modified = []
>>
>>     data_append = data.append
>>     mask_append = mask.append
>>     outliers_append = outliers.append
>>     modified_append = modified.append
>>
>>
>
> I know some people do this to speed things up. Really, I don't think it's
> necessary or wise to do so.
>
>
>>     maskcount = 0
>>     outliercount = 0
>>     modifiedcount = 0
>>
>>     for row in reader:
>>         if '[MASKS]' in row:
>>             maskcount += 1
>>         if '[OUTLIERS]' in row:
>>             outliercount += 1
>>         if '[MODIFIED]' in row:
>>             modifiedcount += 1
>>          if not any((maskcount, outliercount, modifiedcount, not row)):
>>             data_append(row)
>>         elif not any((outliercount, modifiedcount, not row)):
>>             mask_append(row)
>>         elif not any((modifiedcount, not row)):
>>             outliers_append(row)
>>         else:
>>             if row: modified_append(row)
>>
>>
>
> Just playing with the logic here:
>
> 1. Notice that if "not row" is True, nothing happens? Pull it out
> explicitly.
>
> 2. Notice how it switches from mode to mode? Program it more explicitly.
>
> Here's my suggestion:
>
> def parse_masks(reader):
>     for row in reader:
>         if not row: continue
>         elif '[OUTLIERS]' in row: parse_outliers(reader)
>         elif '[MODIFIED]' in row: parse_modified(reader)
>        masks.append(row)
>
> def parse_outliers(reader):
>     for row in reader:
>         if not row: continue
>         elif '[MODIFIED]' in row: parse_modified(reader)
>        outliers.append(row)
>
> def parse_modified(reader):
>     for row in reader:
>         if not row: continue
>        modified.append(row)
>
> for row in reader:
>     if not row: continue
>     elif '[MASKS]' in row: parse_masks(reader)
>     elif '[OUTLIERS]' in row: parse_outliers(reader)
>     elif '[MODIFIED]' in row: parse_modified(reader)
>     else: data.append(row)
>
> Since there is global state involved, you may want to save yourself some
> trouble in the future and put the above in a class where separate parsers
> can be kept separate.
>
> It looks like your program is turning into a regular old parser. Any format
> that is a little more than trivial to parse will need a real parser like the
> above.
>
> --
> Jonathan Gardner
> jgardner at jonathangardner.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100220/f334e40f/attachment-0001.html>


More information about the Python-list mailing list