looking for speed-up ideas

Sun Feb 9 00:22:16 EST 2003

On Wed, 05 Feb 2003 04:41:28 GMT, Ram Bhamidipaty <ramb at sonic.net> wrote:

>> ====< ramb.py >=====================================================
>> # ramb.py
>> import sys
>> lines = file(sys.argv[1]).readlines()
>> tups = []; tapp = tups.append; i = -1
>> for line in lines:
>>     i += 1
>>     if line.startswith('F'): tapp((int(line.split('/')[1]), i))
>> tups.sort()
>> dict200 = dict([(i, size) for size, i in tups[-200:]])
>> 
>> path = [tuple(lines[0].split()[1:])]
>> i = -1
>> for line in lines:
>>     i += 1
>>     if line.startswith('S'):
>>         name, parent, thisnum = line[2:].split('/')
>>         while path and path[-1][1] != parent: path.pop()
>>         path.append((name, thisnum.strip()))
>>     if not dict200.has_key(i): continue
>>     dict200[i] = (dict200[i], path[:], lines[i]) # size, path, fname
>> tups = dict200.values()
>> tups.sort()
>> fmt = '%12s  %s' 
>> print fmt % ('size', 'path')
>> print fmt % ('-'*12, '-'*50)
>> for size, path, fname in tups:
>>     fname = fname.split('/')[-1].strip()
>>     path = '/'.join([name for name, num in path]+[fname])
>>     print fmt % (size, path)    
>> ====================================================================
>> running this on your test data
>
>> I'd be curious how long it would run on your machine. I assume your
>> memory is large enough to hold the line list.
>
>Thank you for the reply.
>
>Your script ran in:
>
>espring> python /remote/espring/ramb/tools/lib/python2.2/profile.py script4.py /tmp/foo /tmp/foo_4
>         3 function calls in 36.720 CPU seconds
>
>You may want to take a look at some of the other postings in this thread. I
>suspect that this algo could also be tweaked to get into the same 16 second
>range of the other scripts.
>
Well, it wasn't much of a shot at an optimum, since I did practically nothing
at all in the shadow of all that i/o waiting to get it all into memory.

Also, getting the paths built right proved more complex than it might have been
with another encoding of the input data. Also, fixed width for size in F-lines would have
wasted space, but would have allowed sort without conversion. Do you have control
over the encoding of the input?

>It would be impressive if there were a _pure_ python script that could
>deliver the performance of the grep + sort + tail command pipe line.
>
With the help of Psyco, maybe.

Regards,
Bengt Richter