looking for speed-up ideas

Bengt Richter bokr at oz.net
Wed Feb 5 00:00:48 CET 2003


On Tue, 04 Feb 2003 00:39:09 GMT, Ram Bhamidipaty <ramb at sonic.net> wrote:

>I have some python code that processes a large file. I want to see how
>much faster this code can get. Mind you, I don't _need_ the code to go
>faster - but it sure would be nice if it were faster...
>
>Here is a specification of the input:
> 1. Lines start with T, S or F
> 2. The first line of the file starts with
>    T, all other lines start with S or F.
> 3. F lines look like "F/<number>/string"
> 4. S lines look like "S/string/<number>/<number>"
>
>Here is a sample:
>
>T /remote 0
>S/name/0/1
>S/joe/1/2
>S/bob/1/3
>F/3150900/big_file.tar.gz
>S/testing/3/4
>F/414/.envrc
>F/276/BUILD_FLAGS
>F/36505/make.incl
>F/3861/build_envrc
>
>In case you are curious the file is a dump of a file system. F lines
>specify a file name and file size. S lines speficy a directory. The
>numbers on an S line represent a directory number and a directory
>parent number. All the F lines under an S line are files in a
>particular directory.
>
>My script reads the file and prints out the 200 largest files.
>
Thank you for taking the care to prepare a clear post ;-)

I think the first thing I would try to do is minimize the total processing
as much as possible by first only doing what's needed to identify the 200 files
you are interested in. To do that, I would form a list of 2-tuples composed
of sizes from F-lines and their line numbers, then sort that to get the
last 200 tuples for the F-line line numbers and sizes of the biggest files.

Then I'd make a directory of that info with line numbers as keys and sizes
as initial values. Then I'd scan the entire lines array again, keeping
a current state representing the current full directory path at the
current line number (without building a tree -- note that I am assuming the file
represents an orderly visiting of subdirectories). I would check line
numbers in this second pass for presence as key in the directory of 200,
and if there, replace the size value with a tuple of the size, F-line text,
and current path info, making complete file spec info. This would construct
full file info for all 200 largest files in the dict. I would then get
the value list and sort it (it would sort on size as the first item in the
tuples), and then print suitably formatted output.

E.g., (Only tested as shown below)
(BTW, your code was much better commented ;-)

====< ramb.py >=====================================================
# ramb.py
import sys
lines = file(sys.argv[1]).readlines()
tups = []; tapp = tups.append; i = -1
for line in lines:
    i += 1
    if line.startswith('F'): tapp((int(line.split('/')[1]), i))
tups.sort()
dict200 = dict([(i, size) for size, i in tups[-200:]])

path = [tuple(lines[0].split()[1:])]
i = -1
for line in lines:
    i += 1
    if line.startswith('S'):
        name, parent, thisnum = line[2:].split('/')
        while path and path[-1][1] != parent: path.pop()
        path.append((name, thisnum.strip()))
    if not dict200.has_key(i): continue
    dict200[i] = (dict200[i], path[:], lines[i]) # size, path, fname
tups = dict200.values()
tups.sort()
fmt = '%12s  %s' 
print fmt % ('size', 'path')
print fmt % ('-'*12, '-'*50)
for size, path, fname in tups:
    fname = fname.split('/')[-1].strip()
    path = '/'.join([name for name, num in path]+[fname])
    print fmt % (size, path)    
====================================================================
running this on your test data
====< ramb.txt >====================================================
T /remote 0
S/name/0/1
S/joe/1/2
S/bob/1/3
F/3150900/big_file.tar.gz
S/testing/3/4
F/414/.envrc
F/276/BUILD_FLAGS
F/36505/make.incl
F/3861/build_envrc
====================================================================
gives this output:

[15:01] C:\pywk\clp>ramb.py ramb.txt
        size  path
------------  --------------------------------------------------
         276  /remote/name/bob/testing/BUILD_FLAGS
         414  /remote/name/bob/testing/.envrc
        3861  /remote/name/bob/testing/build_envrc
       36505  /remote/name/bob/testing/make.incl
     3150900  /remote/name/bob/big_file.tar.gz

I'd be curious how long it would run on your machine. I assume your
memory is large enough to hold the line list.

Regards,
Bengt Richter




More information about the Python-list mailing list