speed question, reading csv using takewhile() and dropwhile()
Vincent Davis
vincent at vincentdavis.net
Sat Feb 20 20:32:28 EST 2010
Thanks again for the comment, not sure I will implement all of it but I will
separate the "if not row" The files have some extraneous blank rows in the
middle that I need to be sure not to import as blank rows.
I am actually having trouble with this filling my sys memory, I posted a
separate question "Why is this filling my sys memory" or something like that
is the subject.
I might be that my 1yr old son has been trying to help for the last hour. It
is very distracting.
*Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>
On Sat, Feb 20, 2010 at 6:18 PM, Jonathan Gardner <
jgardner at jonathangardner.net> wrote:
> On Sat, Feb 20, 2010 at 4:21 PM, Vincent Davis <vincent at vincentdavis.net>wrote:
>
>> Thanks for the help, this is considerably faster and easier to read (see
>> below). I changed it to avoid the "break" and I think it makes it easy to
>> understand. I am checking the conditions each time slows it but it is worth
>> it to me at this time.
>>
>>
> It seems you are beginning to understand that programmer time is more
> valuable than machine time. Congratulations.
>
>
>
>> def read_data_file(filename):
>> reader = csv.reader(open(filename, "U"),delimiter='\t')
>>
>> data = []
>> mask = []
>> outliers = []
>> modified = []
>>
>> data_append = data.append
>> mask_append = mask.append
>> outliers_append = outliers.append
>> modified_append = modified.append
>>
>>
>
> I know some people do this to speed things up. Really, I don't think it's
> necessary or wise to do so.
>
>
>> maskcount = 0
>> outliercount = 0
>> modifiedcount = 0
>>
>> for row in reader:
>> if '[MASKS]' in row:
>> maskcount += 1
>> if '[OUTLIERS]' in row:
>> outliercount += 1
>> if '[MODIFIED]' in row:
>> modifiedcount += 1
>> if not any((maskcount, outliercount, modifiedcount, not row)):
>> data_append(row)
>> elif not any((outliercount, modifiedcount, not row)):
>> mask_append(row)
>> elif not any((modifiedcount, not row)):
>> outliers_append(row)
>> else:
>> if row: modified_append(row)
>>
>>
>
> Just playing with the logic here:
>
> 1. Notice that if "not row" is True, nothing happens? Pull it out
> explicitly.
>
> 2. Notice how it switches from mode to mode? Program it more explicitly.
>
> Here's my suggestion:
>
> def parse_masks(reader):
> for row in reader:
> if not row: continue
> elif '[OUTLIERS]' in row: parse_outliers(reader)
> elif '[MODIFIED]' in row: parse_modified(reader)
> masks.append(row)
>
> def parse_outliers(reader):
> for row in reader:
> if not row: continue
> elif '[MODIFIED]' in row: parse_modified(reader)
> outliers.append(row)
>
> def parse_modified(reader):
> for row in reader:
> if not row: continue
> modified.append(row)
>
> for row in reader:
> if not row: continue
> elif '[MASKS]' in row: parse_masks(reader)
> elif '[OUTLIERS]' in row: parse_outliers(reader)
> elif '[MODIFIED]' in row: parse_modified(reader)
> else: data.append(row)
>
> Since there is global state involved, you may want to save yourself some
> trouble in the future and put the above in a class where separate parsers
> can be kept separate.
>
> It looks like your program is turning into a regular old parser. Any format
> that is a little more than trivial to parse will need a real parser like the
> above.
>
> --
> Jonathan Gardner
> jgardner at jonathangardner.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100220/f334e40f/attachment-0001.html>
More information about the Python-list
mailing list