speed question, reading csv using takewhile() and dropwhile()

Sat Feb 20 20:18:24 EST 2010

On Sat, Feb 20, 2010 at 4:21 PM, Vincent Davis <vincent at vincentdavis.net>wrote:

> Thanks for the help, this is considerably faster and easier to read (see
> below). I changed it to avoid the "break" and I think it makes it easy to
> understand. I am checking the conditions each time slows it but it is worth
> it to me at this time.
>
>
It seems you are beginning to understand that programmer time is more
valuable than machine time. Congratulations.

> def read_data_file(filename):
>     reader = csv.reader(open(filename, "U"),delimiter='\t')
>
>     data = []
>     mask = []
>     outliers = []
>     modified = []
>
>     data_append = data.append
>     mask_append = mask.append
>     outliers_append = outliers.append
>     modified_append = modified.append
>
>

I know some people do this to speed things up. Really, I don't think it's
necessary or wise to do so.

>     maskcount = 0
>     outliercount = 0
>     modifiedcount = 0
>
>     for row in reader:
>         if '[MASKS]' in row:
>             maskcount += 1
>         if '[OUTLIERS]' in row:
>             outliercount += 1
>         if '[MODIFIED]' in row:
>             modifiedcount += 1
>         if not any((maskcount, outliercount, modifiedcount, not row)):
>             data_append(row)
>         elif not any((outliercount, modifiedcount, not row)):
>             mask_append(row)
>         elif not any((modifiedcount, not row)):
>             outliers_append(row)
>         else:
>             if row: modified_append(row)
>
>

Just playing with the logic here:

1. Notice that if "not row" is True, nothing happens? Pull it out
explicitly.

2. Notice how it switches from mode to mode? Program it more explicitly.

Here's my suggestion:

def parse_masks(reader):
    for row in reader:
        if not row: continue
        elif '[OUTLIERS]' in row: parse_outliers(reader)
        elif '[MODIFIED]' in row: parse_modified(reader)
       masks.append(row)

def parse_outliers(reader):
    for row in reader:
        if not row: continue
        elif '[MODIFIED]' in row: parse_modified(reader)
       outliers.append(row)

def parse_modified(reader):
    for row in reader:
        if not row: continue
       modified.append(row)

for row in reader:
    if not row: continue
    elif '[MASKS]' in row: parse_masks(reader)
    elif '[OUTLIERS]' in row: parse_outliers(reader)
    elif '[MODIFIED]' in row: parse_modified(reader)
    else: data.append(row)

Since there is global state involved, you may want to save yourself some
trouble in the future and put the above in a class where separate parsers
can be kept separate.

It looks like your program is turning into a regular old parser. Any format
that is a little more than trivial to parse will need a real parser like the
above.

-- 
Jonathan Gardner
jgardner at jonathangardner.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100220/66c36081/attachment-0001.html>